Integrating Data Mining and Data Management Technologies for Scholarly Inquiry

Autor: Paul Watry, Jerome Fuselier, Luis A. Aguilar, Richard Marciano, Shreyas, Ray R. Larson, John Harrison, Chien-Yi Hou
Rok vydání: 2014
Předmět:
Zdroj: BigData Conference
Popis: This short paper discusses the “Integrating Data Mining and Data Management Technologies for Scholarly Inquiry” project. In this “Round Two” Digging Into Data Challenge award, we explored uses and approaches for large-scale data analysis and processing for the Humanities and Social Sciences through the integration of several infrastructure frameworks: Cheshire, iRODS, and Amazon Web Services (EC2 computing and S3 storage). Our “big data” consisted of the entire texts collection of the Internet Archive (approximately 3.6 million volumes) and the entire JSTOR database. We performed surface-level natural language processing on this data to identify noun phrases and further refinements to identify personal, corporate, and geographic names. We then used resources including library and archival authority records to identify variants and merge names. The goal is to create an integrated index of persons, places, and organizations referenced in our collections.
Databáze: OpenAIRE