Effective and efficient retrieval of structured entities
Autor: | Jungho Park, Shaoxu Song, Yunsu Lee, Soo-Hyung Kim, Ruihong Huang, Sungmin Yi |
---|---|
Rok vydání: | 2020 |
Předmět: |
Structure (mathematical logic)
Containment (computer programming) Word embedding Information retrieval Hierarchy (mathematics) Computer science computer.internet_protocol General Engineering 02 engineering and technology computer.file_format Object (computer science) JSON 020204 information systems 0202 electrical engineering electronic engineering information engineering 020201 artificial intelligence & image processing RDF computer XML computer.programming_language |
Zdroj: | Proceedings of the VLDB Endowment. 13:826-839 |
ISSN: | 2150-8097 |
Popis: | Structured entities are commonly abstracted, such as from XML, RDF or hidden-web databases. Direct retrieval of various structured entities is highly demanded in data lakes, e.g., given a JSON object, to find the XML entities that denote the same real-world object. Existing approaches on evaluating structured entity similarity emphasize too much the structural inconsistency. Indeed, entities from heterogeneous sources could have very distinct structures, owing to various information representation conventions. We argue that the retrieval could be more tolerant to structural differences and focus more on the contents of the entities. In this paper, we first identify the unique challenge of parent-child (containment) relationships among structured entities, which unfortunately prevent the retrieval of proper entities (returning parents or children). To solve the problem, a novel hierarchy smooth function is proposed to combine the term scores in different nodes of a structured entity. Entities sharing the same structure, namely an entity family, are employed to learn the coefficient in aggregating the scores, and thus distinguish/prune the parent or child entities. Remarkably, the proposed method could cooperate with both the bag-of-words (BOW) and word embedding models, successful in retrieving unstructured documents, for querying structured entities. Extensive experiments on real datasets demonstrate that our proposal is effective and efficient. |
Databáze: | OpenAIRE |
Externí odkaz: |