Parallel identification of the spelling variants in corpora

Autor: Reynaert, M.W.C., Lopresti, D., Roy, S., Schulz, K., Venkata Subramaniam, L.
Přispěvatelé: Creative Computing
Jazyk: angličtina
Rok vydání: 2009
Předmět:
Zdroj: Proceedings of the Third workshop on Analytics for Noisy Unstructured Text Data 2009, 77-84
STARTPAGE=77;ENDPAGE=84;TITLE=Proceedings of the Third workshop on Analytics for Noisy Unstructured Text Data 2009
AND
Popis: We present a new approach based on anagram hashing to globally handle the typographical variation in large and possibly noisy text collections. Typographical variation is typically handled in a local fashion: given one particular text string some system of retrieving near-neighbours is applied, where near-neighbours are other text strings that differ from the particular string by a given number of characters. The difference in characters between the original string and one of its retrieved near-neighbours we call a particular character confusion. We present a global way of performing this action: given a possible particular character confusion, we identify - in parallel, i.e. in one single operation on anagram-hash derived bit vectors - all the pairs of text strings in the text collection to which the particular confusion applies. The algorithm proposed here is evaluated on about 23,000 English attested typos from the Reuters rcv1 text collection. We further explore its usefulness for unsupervised linking of a historical Dutch word list to its contemporary counterpart.
Databáze: OpenAIRE