Term Weight Algorithm Oriented Terms: Low Frequency Rather Than Little Occurrences
Autor: | Yanhuang jiang, Yuhong Huang, Shijie Li, Tiejun Li, Yiyi He |
---|---|
Rok vydání: | 2020 |
Předmět: |
Logarithm
Computer science A little better 020206 networking & telecommunications 02 engineering and technology Low frequency Expression (mathematics) Term (time) 0202 electrical engineering electronic engineering information engineering Feature (machine learning) General Earth and Planetary Sciences 020201 artificial intelligence & image processing Algorithm General Environmental Science |
Zdroj: | KES |
ISSN: | 1877-0509 |
DOI: | 10.1016/j.procs.2020.09.079 |
Popis: | Term weight algorithms based on inverse document analysis are widely used in the expression of characteristic information for text. According to the finding that frequently occurring terms always cover less feature information for the text, the terms with lower frequency will be endowed higher weight. However, the terms with little occurrences always display unimportant information or even error information, such as rare terms and misspelled terms. To tackle such a problem, this paper proposed a novel term weight algorithm that focuses on the terms with low frequency rather than little occurrences. With the statistics based on non-homogeneous compression of term frequency, the action of terms with concerned frequency will be highlighted. And logarithmic function combined with the number of terms with the same frequency is utilized to weight the terms with different frequency based on different compression intervals. Comparing with TF-IDF and SIF, the proposed approach has a similar performance with SIF and a little better than TF-IDF. According to the difference among such methods, a finding shows that the term with a low frequency rather than little occurrences may dominate the feature information of the text. |
Databáze: | OpenAIRE |
Externí odkaz: |