Story Forest
Autor: | Linglong Kong, Di Niu, Bang Liu, Yu Xu, Kunfeng Lai, Fred X. Han |
---|---|
Rok vydání: | 2020 |
Předmět: |
General Computer Science
business.industry Event (computing) Computer science Timeline 02 engineering and technology Document clustering World Wide Web User experience design 020204 information systems 0202 electrical engineering electronic engineering information engineering Graph (abstract data type) 020201 artificial intelligence & image processing The Internet business Set (psychology) Cluster analysis |
Zdroj: | ACM Transactions on Knowledge Discovery from Data. 14:1-28 |
ISSN: | 1556-472X 1556-4681 |
DOI: | 10.1145/3377939 |
Popis: | Extracting events accurately from vast news corpora and organize events logically is critical for news apps and search engines, which aim to organize news information collected from the Internet and present it to users in the most sensible forms. Intuitively speaking, an event is a group of news documents that report the same news incident possibly in different ways. In this article, we describe our experience of implementing a news content organization system at Tencent to discover events from vast streams of breaking news and to evolve news story structures in an online fashion. Our real-world system faces unique challenges in contrast to previous studies on topic detection and tracking (TDT) and event timeline or graph generation, in that we (1) need to accurately and quickly extract distinguishable events from massive streams of long text documents, and (2) must develop the structures of event stories in an online manner, in order to guarantee a consistent user viewing experience. In solving these challenges, we propose Story Forest , a set of online schemes that automatically clusters streaming documents into events, while connecting related events in growing trees to tell evolving stories. A core novelty of our Story Forest system is EventX , a semi-supervised scheme to extract events from massive Internet news corpora. EventX relies on a two-layered, graph-based clustering procedure to group documents into fine-grained events. We conducted extensive evaluations based on (1) 60 GB of real-world Chinese news data, (2) a large Chinese Internet news dataset that contains 11,748 news articles with truth event labels, and (3) the 20 News Groups English dataset, through detailed pilot user experience studies. The results demonstrate the superior capabilities of Story Forest to accurately identify events and organize news text into a logical structure that is appealing to human readers. |
Databáze: | OpenAIRE |
Externí odkaz: |