DataSynapse: A Social Data Curation Foundry

Autor:	Moshe Chai Barukh, Hamid Reza Motahari-Nezhad, Amin Beheshti, Boualem Benatallah, Reza Nouri, Alireza Tabebordbar
Rok vydání:	2018
Předmět:	Feature engineering Information Systems and Management Data curation business.industry Computer science Big data 02 engineering and technology Asset (computer security) Data structure Data science Open data Hardware and Architecture 020204 information systems 0202 electrical engineering electronic engineering information engineering Domain knowledge business Raw data Software Information Systems
Zdroj:	Distributed and Parallel Databases. 37:351-384
ISSN:	1573-7578 0926-8782
DOI:	10.1007/s10619-018-7245-1
Popis:	Social data analytics have become a vital asset for organizations and governments. For example, over the last few years, governments started to extract knowledge and derive insights from vastly growing open data to personalize the advertisements in elections, improve government services, predict intelligence activities, as well as to improve national security and public health. A key challenge in analyzing social data is to transform the raw data generated by social actors into curated data, i.e., contextualized data and knowledge that is maintained and made available for use by end-users and applications. To address this challenge, we present the notion of knowledge lake, i.e., a contextualized Data Lake, to provide the foundation for big data analytics by automatically curating the raw social data and to prepare them for deriving insights. We present a social data curation foundry, namely DataSynapse, to enable analysts engage with social data to uncover hidden patterns and generate insight. In DataSynapse, we present a scalable algorithm to transform social items (e.g., a Tweet in Twitter) into semantic items, i.e., contextualized and curated items. This algorithm offers customizable feature extraction to harness desired features from diverse data sources. To link contextualized information items to the domain knowledge, we present a scalable technique which leverages cross document coreference resolution assisting analysts to derive targeted insights. DataSynapse is offered as an extensible and scalable microservice-based architecture that are publicly available on GitHub supporting networks such as Twitter, Facebook, GooglePlus and LinkedIn. We adopt a typical scenario for analyzing urban social issues from Twitter as it relates to the government budget, to highlight how DataSynapse significantly improves the quality of extracted knowledge compared to the classical curation pipeline (in the absence of feature extraction, enrichment and domain-linking contextualization).
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::a8220643ed1e6c5b3c4c54fe3933f866 https://doi.org/10.1007/s10619-018-7245-1 Zobrazit plný text záznamu Full text from SpringerLink