Interpretable Topic Modeling Using Near-Identity Cross-Document Coreference Resolution
Autor: | Bela Gipp, Felix Hamborg, Anastasia Zhukova |
---|---|
Rok vydání: | 2020 |
Předmět: |
Topic model
Coreference 010304 chemical physics Computer science business.industry Usability 02 engineering and technology Resolution (logic) computer.software_genre 01 natural sciences Automatic summarization 0103 physical sciences 0202 electrical engineering electronic engineering information engineering Identity (object-oriented programming) 020201 artificial intelligence & image processing Use case Artificial intelligence business Composition (language) computer Natural language processing |
Zdroj: | JCDL |
DOI: | 10.1145/3383583.3398564 |
Popis: | Topic modeling is a technique used in a broad spectrum of use cases, such as data exploration, summarization, and classification. Despite being a crucial constituent of many use cases, established topic models, such as LDA, often produce statistically valid yet non-meaningful topics, i.e., that cannot easily be interpreted by humans. In turn, the usability of topic modeling approaches, e.g., in document summarization, is non-optimal. We propose a topic modeling approach that uses TCA, a method for also near-identity cross-document coreference resolution. TCA showed promising results when resolving mentions of not only persons and other named entities, but also broad, vague, or abstract concepts. In a preliminary evaluation on news articles, we compare the approach with state-of-the-art topic modeling. We find that (1) the four baselines produce statistically valid yet hollow topics or topics that only refer to events in the dataset but not the events' topical composition. (2) TCA is the only approach that extracts topics that distinctively describe meaningful parts of the dataset. |
Databáze: | OpenAIRE |
Externí odkaz: |