On the potential of domain literature for clustering and Bayesian network learning
Autor: | Geert Fannes, Patrick Glenisson, Péter Antal |
---|---|
Předmět: |
Artificial neural network
business.industry Computer science Contrast (statistics) Bayesian network Statistical model Machine learning computer.software_genre Domain (software engineering) Text mining Knowledge extraction Similarity (psychology) Artificial intelligence Cluster analysis business computer |
Zdroj: | Scopus-Elsevier KDD |
Popis: | Thanks to its increasing availability, electronic literature can now be a major source of information when developing complex statistical models where data is scarce or contains much noise. This raises the question of how to integrate information from domain literature with statistical data. Because quantifying similarities or dependencies between variables is a basic building block in knowledge discovery, we consider here the following question. Which vector representations of text and which statistical scores of similarity or dependency support best the use of literature in statistical models? For the text source, we assume to have annotations for the domain variables as short free-text descriptions and optionally to have a large literature repository from which we can further expand the annotations. For evaluation, we contrast the variables similarities or dependencies obtained from text using different annotation sources and vector representations with those obtained from measurement data or expert assessments. Specifically, we consider two learning problems: clustering and Bayesian network learning. Firstly, we report performance (against an expert reference) for clustering yeast genes from textual annotations. Secondly, we assess the agreement between text-based and data-based scores of variable dependencies when learning Bayesian network substructures for the task of modeling the joint distribution of clinical measurements of ovarian tumors. |
Databáze: | OpenAIRE |
Externí odkaz: |