HistorEx: Exploring Historical Text Corpora Using Word and Document Embeddings

Autor: Harald Sack, Russa Biswas, Maria Koutraki, Daniela Kaun, Michael Brunzel, Sven Müller, Tabea Tietz
Rok vydání: 2019
Předmět:
Zdroj: The Semantic Web: ESWC 2019 Satellite Events ISBN: 9783030323264
ESWC (Satellite Events)
Popis: Written text can be understood as a means to acquire insights into the nature of past and present cultures and societies. Numerous projects have been devoted to digitizing and publishing historical textual documents in digital libraries which scientists can utilize as valuable resources for research. However, the extent of textual data available exceeds humans’ abilities to explore the data efficiently. In this paper, a framework is presented which combines unsupervised machine learning techniques and natural language processing on the example of historical text documents on the 19th century of the USA. Named entities are extracted from semi-structured text, which is enriched with complementary information from Wikidata. Word embeddings are leveraged to enable further analysis of the text corpus, which is visualized in a web-based application.
Databáze: OpenAIRE