Application for automated statistical language model construction by processing language corpus

Autor: Kovačić, Tomislav
Přispěvatelé: Banek, Marko
Jazyk: chorvatština
Rok vydání: 2015
Předmět:
Popis: Ovaj diplomski rad daje uvid u problematiku predviđanja niza riječi prilikom obrade prirodnog jezika. Statistički jezični model baziran je na N-gramima i stvara se obradom jezičnog korpusa brojanjem riječi i nizova riječi. Jezični model pretpostavlja postojanje Markovljevih svojstava među uzastopnim riječima prirodnog jezika. S obzirom na podatkovnu rijetkost korpusa javlja se problem N-grama vjerojatnosti nula zbog njihovog nepojavljivanja u jezičnom korpusu. Kako bi se taj problem ublažio koriste se tehnike zaglađivanja koje će i takvim N-gramima pridjeliti određenu vjerojatnost na temelju procjene frekvencija postojećih N-grama. U tu svrhu razvijeno je puno algoritama zaglađivanja od kojih su neki opisani u ovom radu. Neki algoritmi se također obično ne koriste samostalno već upareni s nekim drugim algoritmima zaglađivanja. Ovisno o samoj implementaciji algoritmi prilikom zaglađivanja također mogu naići na probleme te svaki od njih ima određene prednosti i mane pa se s toga odabiru ovisno o vrsti problema i području primjene. Kako bi se odredio najvjerojatniji niz riječi na temelju ulaznih podataka i izgrađenog jezičnog modela koristi se Viterbijev algoritam koji će dati odgovor koji je najvjerojatnji niz stanja skrivenog Markovljevog modela. Za usporedbu kvalitete jezičnog modela najčešće se koristi mjera perpleksije. U ovom radu kao najkvalitetnji model pokazao se onaj izgrađen Kneser-Ney algoritmom zaglađivanja. This master's thesis gives an insight into the issues of predicting word sequences in natural language processing. Statistical language model is based on n-grams and it is created by language corpus processing (counting the words and word sequences). Language model assumes the existence of Markov properties among the consecutive words in a natural language. Due to the data sparsity of the corpus there is a problem of zero probability n-gram because of their nonappearance in the language corpus. In order to alleviate the problem we use smoothing techniques which will assign a specific probability to those n-grams based on the estimate of frequencies of the existing n-grams. For that purpose many smoothing algorithms have been developed, some of which have been described in this thesis. Some algorithms are not usually used independently but they are paired with some other smoothing algorithms. Depending on the implementation, the algorithms can come across some problems while smoothing and each of them has certain advantages and disadvantages and therefore they are chosen based on the type of the problem and scope of application. In order to determine the most probable word sequence based on the input data and the created language model, we use Viterbi algorithm that will give an answer which is the most probable state sequence of the hidden Markov model. In order to compare the quality of the language model the measure of perplexity is most frequently used. In this thesis the best model was the one that was created by Kneser-Ney smoothing algorithm.
Databáze: OpenAIRE