Parallel identification of the spelling variants in corpora
Autor: | Reynaert, M.W.C., Lopresti, D., Roy, S., Schulz, K., Venkata Subramaniam, L. |
---|---|
Přispěvatelé: | Creative Computing |
Jazyk: | angličtina |
Rok vydání: | 2009 |
Předmět: |
Information retrieval
Anagram Computer science Character (computing) business.industry String (computer science) Hash function ComputingMilieux_LEGALASPECTSOFCOMPUTING Noisy text computer.software_genre Spelling Identification (information) Variation (linguistics) ComputingMethodologies_DOCUMENTANDTEXTPROCESSING Artificial intelligence business computer Natural language processing |
Zdroj: | Proceedings of the Third workshop on Analytics for Noisy Unstructured Text Data 2009, 77-84 STARTPAGE=77;ENDPAGE=84;TITLE=Proceedings of the Third workshop on Analytics for Noisy Unstructured Text Data 2009 AND |
Popis: | We present a new approach based on anagram hashing to globally handle the typographical variation in large and possibly noisy text collections. Typographical variation is typically handled in a local fashion: given one particular text string some system of retrieving near-neighbours is applied, where near-neighbours are other text strings that differ from the particular string by a given number of characters. The difference in characters between the original string and one of its retrieved near-neighbours we call a particular character confusion. We present a global way of performing this action: given a possible particular character confusion, we identify - in parallel, i.e. in one single operation on anagram-hash derived bit vectors - all the pairs of text strings in the text collection to which the particular confusion applies. The algorithm proposed here is evaluated on about 23,000 English attested typos from the Reuters rcv1 text collection. We further explore its usefulness for unsupervised linking of a historical Dutch word list to its contemporary counterpart. |
Databáze: | OpenAIRE |
Externí odkaz: |