Autor: |
Divesh Srivastava, Paolo Merialdo, Paolo ATZENI, Federico Piai |
Přispěvatelé: |
Piai, Federico, Atzeni, Paolo, Merialdo, Paolo, Srivastava, Divesh |
Rok vydání: |
2022 |
Předmět: |
|
Popis: |
We focus on the key task of semantic type discovery over a set of heterogeneous sources, an important data preparation task. We consider the challenging setting of multiple Web data sources in a vertical domain, which present sparsity of data and a high degree of heterogeneity, even internally within each individual source. We assume each source provides a collection of entity specifications, i.e. entity descriptions, each expressed as a set of attribute name-value pairs. Semantic type discovery aims at clustering individual attribute name-value pairs that represent the same semantic concept. We take advantage of the opportunities arising from the redundancy of information across such sources and propose the iterative RaF-STD solution, which consists of three key steps: (i) a Bayesian model analysis of overlapping information across sources to match the most locally homogeneous attributes; (ii) a tagging approach, inspired by NLP techniques, to create (virtual) homogeneous attributes from portions of heterogeneous attribute values; and (iii) a novel use of classical techniques based on matching of attribute names and domains. Empirical evaluation on the DI2KG and WDC benchmarks demonstrates the superiority of RaF-STD over alternative approaches adapted from the literature. |
Databáze: |
OpenAIRE |
Externí odkaz: |
|