The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text

Autor: Marta Iannuccelli, Alfonso Valencia, Zhiyong Lu, Andrew G. Winter, Charles Elkan, Shashank Agarwal, Ashish V. Tendulkar, Martin Krallinger, Robert Leaman, Rafal Rak, Florian Leitner, Keith Noto, Feifan Liu, Hagit Shatkay, Gerold Schneider, Sun Kim, Graciela Gonzalez, Miguel Vazquez, W. John Wilbur, Sérgio Matos, Rezarta Islamaj Dogan, Xinglong Wang, Livia Perfetto, Luis M. Rocha, David Salgado, Miguel A. Andrade-Navarro, Luisa Castagnoli, Fabio Rinaldi, Leonardo Briganti, Gianni Cesareni, Mike Tyers, Luana Licata, Jean-Fred Fontaine, Andrew Chatr-aryamontri
Přispěvatelé: Génétique Médicale et Génomique Fonctionnelle (GMGF), Aix Marseille Université (AMU)-Assistance Publique - Hôpitaux de Marseille (APHM)- Hôpital de la Timone [CHU - APHM] (TIMONE)-Institut National de la Santé et de la Recherche Médicale (INSERM)-Centre National de la Recherche Scientifique (CNRS), Université de Montréal (UdeM), Università degli Studi di Roma Tor Vergata [Roma], Mount Sinai Hospital [Toronto, Canada] (MSH), Institute of Computational Linguistics, Universität Zürich [Zürich] = University of Zurich (UZH), Department of Electronics, Telecommunications and Informatics [Aveiro] (DETI), Universidade de Aveiro, Department of Pathology, Case Western Reserve University [Cleveland], Sorenson Molecular Genealogy Foundation, National Center for Biotechnology Information (NCBI), Mitochondrie : Régulations et Pathologie, Université d'Angers (UA)-Institut National de la Santé et de la Recherche Médicale (INSERM), National Cancer Institute [Bethesda] (NCI-NIH), National Institutes of Health [Bethesda] (NIH), Institut National de la Santé et de la Recherche Médicale (INSERM)-Aix Marseille Université (AMU)-Assistance Publique - Hôpitaux de Marseille (APHM)- Hôpital de la Timone [CHU - APHM] (TIMONE)-Centre National de la Recherche Scientifique (CNRS)
Rok vydání: 2011
Předmět:
Periodicals as Topic
Animals
Data Mining
Humans
Databases
Protein

Algorithms
Proteins
PubMed
Computer science
Ontology (information science)
computer.software_genre
Biochemistry
Task (project management)
03 medical and health sciences
Databases
Text mining
Structural Biology
Molecular Biology
ComputingMilieux_MISCELLANEOUS
030304 developmental biology
0303 health sciences
business.industry
Applied Mathematics
Document classification
Protein
Research
030302 biochemistry & molecular biology
Biomedical text mining
Computer Science Applications
[INFO.INFO-TT]Computer Science [cs]/Document and Text Processing
Settore BIO/18 - Genetica
Ranking
Test set
Ontology
Data mining
Artificial intelligence
[INFO.INFO-BI]Computer Science [cs]/Bioinformatics [q-bio.QM]
business
computer
Natural language processing
Zdroj: BMC Bioinformatics
BMC Bioinformatics, 2011, 12 (Suppl 8), ⟨10.1186/1471-2105-12-S8-S3⟩
BMC Bioinformatics, BioMed Central, 2011, 12 (Suppl 8), ⟨10.1186/1471-2105-12-S8-S3⟩
BMC Bioinformatics; Vol 12
ISSN: 1471-2105
Popis: Background Determining usefulness of biomedical text mining systems requires realistic task definition and data selection criteria without artificial constraints, measuring performance aspects that go beyond traditional metrics. The BioCreative III Protein-Protein Interaction (PPI) tasks were motivated by such considerations, trying to address aspects including how the end user would oversee the generated output, for instance by providing ranked results, textual evidence for human interpretation or measuring time savings by using automated systems. Detecting articles describing complex biological events like PPIs was addressed in the Article Classification Task (ACT), where participants were asked to implement tools for detecting PPI-describing abstracts. Therefore the BCIII-ACT corpus was provided, which includes a training, development and test set of over 12,000 PPI relevant and non-relevant PubMed abstracts labeled manually by domain experts and recording also the human classification times. The Interaction Method Task (IMT) went beyond abstracts and required mining for associations between more than 3,500 full text articles and interaction detection method ontology concepts that had been applied to detect the PPIs reported in them. Results A total of 11 teams participated in at least one of the two PPI tasks (10 in ACT and 8 in the IMT) and a total of 62 persons were involved either as participants or in preparing data sets/evaluating these tasks. Per task, each team was allowed to submit five runs offline and another five online via the BioCreative Meta-Server. From the 52 runs submitted for the ACT, the highest Matthew's Correlation Coefficient (MCC) score measured was 0.55 at an accuracy of 89% and the best AUC iP/R was 68%. Most ACT teams explored machine learning methods, some of them also used lexical resources like MeSH terms, PSI-MI concepts or particular lists of verbs and nouns, some integrated NER approaches. For the IMT, a total of 42 runs were evaluated by comparing systems against manually generated annotations done by curators from the BioGRID and MINT databases. The highest AUC iP/R achieved by any run was 53%, the best MCC score 0.55. In case of competitive systems with an acceptable recall (above 35%) the macro-averaged precision ranged between 50% and 80%, with a maximum F-Score of 55%. Conclusions The results of the ACT task of BioCreative III indicate that classification of large unbalanced article collections reflecting the real class imbalance is still challenging. Nevertheless, text-mining tools that report ranked lists of relevant articles for manual selection can potentially reduce the time needed to identify half of the relevant articles to less than 1/4 of the time when compared to unranked results. Detecting associations between full text articles and interaction detection method PSI-MI terms (IMT) is more difficult than might be anticipated. This is due to the variability of method term mentions, errors resulting from pre-processing of articles provided as PDF files, and the heterogeneity and different granularity of method term concepts encountered in the ontology. However, combining the sophisticated techniques developed by the participants with supporting evidence strings derived from the articles for human interpretation could result in practical modules for biological annotation workflows.
Databáze: OpenAIRE