HistorEx: Exploring Historical Text Corpora Using Word and Document Embeddings
Autor: | Harald Sack, Russa Biswas, Maria Koutraki, Daniela Kaun, Michael Brunzel, Sven Müller, Tabea Tietz |
---|---|
Rok vydání: | 2019 |
Předmět: |
Text corpus
business.industry Computer science 020207 software engineering 02 engineering and technology Recommender system Digital library computer.software_genre Visualization Cultural heritage Publishing 0202 electrical engineering electronic engineering information engineering Unsupervised learning 020201 artificial intelligence & image processing Artificial intelligence business computer Natural language processing Word (computer architecture) |
Zdroj: | The Semantic Web: ESWC 2019 Satellite Events ISBN: 9783030323264 ESWC (Satellite Events) |
Popis: | Written text can be understood as a means to acquire insights into the nature of past and present cultures and societies. Numerous projects have been devoted to digitizing and publishing historical textual documents in digital libraries which scientists can utilize as valuable resources for research. However, the extent of textual data available exceeds humans’ abilities to explore the data efficiently. In this paper, a framework is presented which combines unsupervised machine learning techniques and natural language processing on the example of historical text documents on the 19th century of the USA. Named entities are extracted from semi-structured text, which is enriched with complementary information from Wikidata. Word embeddings are leveraged to enable further analysis of the text corpus, which is visualized in a web-based application. |
Databáze: | OpenAIRE |
Externí odkaz: |