Leveraging Topic Models to Develop Metrics for Evaluating the Quality of Narrative Threads Extracted from News Stories
Autor: | Naren Ramakrishnan, Luis Asencios Reynoso, Jason Schlachter, Sathappan Muthiah, Alicia Ruvinsky |
---|---|
Rok vydání: | 2015 |
Předmět: |
Structure (mathematical logic)
Topic model business.industry Computer science media_common.quotation_subject Text analytics Data science Industrial and Manufacturing Engineering Topic modeling law.invention Narrative Sensemaking Artificial Intelligence law Analytics Data analytics Machine learning Narrative structure CLARITY Domain knowledge Quality (business) business media_common |
Zdroj: | Procedia Manufacturing. 3:4028-4035 |
ISSN: | 2351-9789 |
DOI: | 10.1016/j.promfg.2015.07.972 |
Popis: | Analysts and software systems are increasingly tasked with making sense of a growing amount of data to help their organizations make decisions involving risk and uncertainty. A key enabler of this work is the ability to quickly discover structure in large amounts of text such as news stories and blogs. Recent work in this area has shown it is possible to automatically link documents from a corpus together to build a narrative structure, called a story chain, without the need for prior domain knowledge [1]. This approach is an unsupervised method that discovers large numbers of story chains of variable quality. In this paper, we describe and evaluate methods to identify the most coherent and informative story chains. We explore two types of topic model based analytics. The first type is a measure of representativeness that captures how well a story chain represents the corpus from which it was generated. This is done by comparing the similarity of topics found over time in a story chain against those expressed in the corpus during the same time period. Our hypothesis is that story chains that have similar topic expression to the corpus will convey narratives that are central to the corpus. This type of analytic could help an analyst quickly focus on the key narratives in a large corpus of documents. The second type is a measure of quality of a story chain and is composed of topic consistency and topic persistence measures. Our hypothesis is that high quality chains would be composed of sequences of stories that have clearly defined primary topics that persist across significant portions of the story chain. We used these analytics to predict the clarity of story chains within one of four categories (1) very clear narrative, 2) somewhat clear narrative, 3) somewhat unclear narrative, 4) very unclear narrative, and found we were able to train a data model to label story chains with the same label as human coders 77% of the time. Our dataset was composed of 7,074 English language news stories released during the Brazil Protests of 2013 from which 5,606 story chains were generated. We randomly selected 60 story chains for hand scoring to serve as our gold standard data set for experimentation. |
Databáze: | OpenAIRE |
Externí odkaz: |