StyloThai
Autor: | Raheem Sarwar, Attapol T. Rutherford, Sarana Nutanong, Thanasarn Porthaveepong, Thanawin Rakthanmanon |
---|---|
Rok vydání: | 2020 |
Předmět: |
General Computer Science
Computer science business.industry Nearest neighbor search Feature vector 05 social sciences 050905 science studies computer.software_genre Set (abstract data type) Identification (information) Classifier (linguistics) Outlier Feature (machine learning) Stylometry Artificial intelligence 0509 other social sciences 050904 information & library sciences business computer Natural language processing |
Zdroj: | ACM Transactions on Asian and Low-Resource Language Information Processing. 19:1-15 |
ISSN: | 2375-4702 2375-4699 |
DOI: | 10.1145/3365832 |
Popis: | Authorship identification helps to identify the true author of a given anonymous document from a set of candidate authors. The applications of this task can be found in several domains, such as law enforcement agencies and information retrieval. These application domains are not limited to a specific language, community, or ethnicity. However, most of the existing solutions are designed for English, and a little attention has been paid to Thai. These existing solutions are not directly applicable to Thai due to the linguistic differences between these two languages. Moreover, the existing solution designed for Thai is unable to (i) handle outliers in the dataset, (ii) scale when the size of the candidate authors set increases, and (iii) perform well when the number of writing samples for each candidate author is low. We identify a stylometric feature space for the Thai authorship identification task. Based on our feature space, we present an authorship identification solution that uses the probabilistic k nearest neighbors classifier by transforming each document into a collection of point sets. Specifically, this document transformation allows us to (i) use set distance measures associated with an outlier handling mechanism, (ii) capture stylistic variations within a document, and (iii) produce multiple predictions for a query document. We create a new Thai authorship identification corpus containing 547 documents from 200 authors, which is significantly larger than the corpus used by the existing study (an increase of 32 folds in terms of the number of candidate authors). The experimental results show that our solution can overcome the limitations of the existing solution and outperforms all competitors with an accuracy level of 91.02%. Moreover, we investigate the effectiveness of each stylometric features category with the help of an ablation study. We found that combining all categories of the stylometric features outperforms the other combinations. Finally, we cross compare the feature spaces and classification methods of all solutions. We found that (i) our solution can scale as the number of candidate authors increases, (ii) our method outperforms all the competitors, and (iii) our feature space provides better performance than the feature space used by the existing study. |
Databáze: | OpenAIRE |
Externí odkaz: |