Fine-grained semantic type discovery for heterogeneous sources using clustering

Autor: Divesh Srivastava, Paolo Merialdo, Paolo ATZENI, Federico Piai
Přispěvatelé: Piai, Federico, Atzeni, Paolo, Merialdo, Paolo, Srivastava, Divesh
Rok vydání: 2022
Předmět:
Popis: We focus on the key task of semantic type discovery over a set of heterogeneous sources, an important data preparation task. We consider the challenging setting of multiple Web data sources in a vertical domain, which present sparsity of data and a high degree of heterogeneity, even internally within each individual source. We assume each source provides a collection of entity specifications, i.e. entity descriptions, each expressed as a set of attribute name-value pairs. Semantic type discovery aims at clustering individual attribute name-value pairs that represent the same semantic concept. We take advantage of the opportunities arising from the redundancy of information across such sources and propose the iterative RaF-STD solution, which consists of three key steps: (i) a Bayesian model analysis of overlapping information across sources to match the most locally homogeneous attributes; (ii) a tagging approach, inspired by NLP techniques, to create (virtual) homogeneous attributes from portions of heterogeneous attribute values; and (iii) a novel use of classical techniques based on matching of attribute names and domains. Empirical evaluation on the DI2KG and WDC benchmarks demonstrates the superiority of RaF-STD over alternative approaches adapted from the literature.
Databáze: OpenAIRE