Efficient, robust and effective rank aggregation for massive biological datasets

Autor:	Alain Denise, Laurent Bulteau, Bryan Brancotte, Pierre Andrieu, Adeline Pierrot, Stéphane Vialette, Sarah Cohen-Boulakia
Přispěvatelé:	Laboratoire Interdisciplinaire des Sciences du Numérique (LISN), Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS), Hub Bioinformatique et Biostatistique - Bioinformatics and Biostatistics HUB, Institut Pasteur [Paris] (IP)-Université Paris Cité (UPCité), Laboratoire d'Informatique Gaspard-Monge (LIGM), École des Ponts ParisTech (ENPC)-Centre National de la Recherche Scientifique (CNRS)-Université Gustave Eiffel, Institut de Biologie Intégrative de la Cellule (I2BC), Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS), This work has been partly supported by CNRS Défi Mastodons Grant., CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS), Institut Pasteur [Paris]-Centre National de la Recherche Scientifique (CNRS)
Jazyk:	angličtina
Rok vydání:	2021
Předmět:	[INFO.INFO-CC]Computer Science [cs]/Computational Complexity [cs.CC] Computer Networks and Communications Computer science [INFO.INFO-DS]Computer Science [cs]/Data Structures and Algorithms [cs.DS] Context (language use) 02 engineering and technology computer.software_genre Set (abstract data type) Consensus ranking 0202 electrical engineering electronic engineering information engineering Rank (computer programming) Large numbers Approximation algorithm 020206 networking & telecommunications Exact algorithm Ranking Hardware and Architecture Rank aggregation Massive biological datasets 020201 artificial intelligence & image processing Data mining [INFO.INFO-BI]Computer Science [cs]/Bioinformatics [q-bio.QM] Heuristics Kemeny rule computer Software
Zdroj:	Future Generation Computer Systems Future Generation Computer Systems, 2021, 124, pp.406-421. ⟨10.1016/j.future.2021.06.013⟩ Future Generation Computer Systems, Elsevier, 2021, 124, pp.406-421. ⟨10.1016/j.future.2021.06.013⟩
ISSN:	0167-739X
DOI:	10.1016/j.future.2021.06.013⟩
Popis:	International audience; Massive biological datasets are available in various sources. To answer a biological question (e.g., ''which are the genes involved in a given disease?''), life scientists query and mine such datasets using various techniques. Each technique provides a list of results usually ranked by importance (e.g., a list of ranked genes). Combining the results obtained by various techniques, that is, combining ranked lists of elements into one list of elements is of paramount importance to help life scientists make the most of various results and prioritize further investigations. Rank aggregation techniques are particularly well-fitted with this context as they take in a set of rankings and provide a consensus, that is, a single ranking which is the ''closest'' to the input rankings. However, (i) the problem of rank aggregation is NP-hard in most cases (using an exact algorithm is currently not possible for more than a few dozens of elements) and (ii) several (possibly very different) exact solutions can be obtained. As answer to (i), many heuristics and approximation algorithms have been proposed. However, heuristics cannot guarantee how far from an exact solution the consensus ranking will be, and the approximation ratio of approximation algorithms dedicated to the problem is fairly high (not less than 3/2). No solution has yet been proposed to help true-users dealing with the problem encountered in point (ii). In this paper we present a complete system able to perform rank aggregation of massive biological datasets. Our solution is efficient as it is based on an original partitioning method making it possible to compute a high-quality consensus using an exact algorithm in a large number of cases. Our solution is robust as it clearly identifies for the user groups of elements whose relative order is the same in any optimal solution. These features provide answers to points (i) and (ii) and lie in mathematical bases offering guarantees on the computed result. Also, our solution is effective as it has been implemented into a real tool, ConquR-BioV2 is used for the life science community, and evaluated at large-scale using a very large number of datasets.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::14f8ab4123bfded52a7169925afab50d https://hal.science/hal-03388443/document Zobrazit plný text záznamu Full Text from ScienceDirect