Nonlinear projection methods for visualizing Barcode data and application on two data sets

Autor: Madalina Olteanu, Violaine Nicolas, Catherine Laredo, Jan Kennis, Christiane Denys, Alain-Didier Missoup, Brigitte Schaeffer
Přispěvatelé: Statistique, Analyse et Modélisation Multidisciplinaire (SAmos-Marin Mersenne) (SAMM), Université Paris 1 Panthéon-Sorbonne (UP1), Chercheur indépendant, Origine, structure et évolution de la biodiversité (OSEB), Muséum national d'Histoire naturelle (MNHN)-Centre National de la Recherche Scientifique (CNRS), Institut de Systématique, Evolution, Biodiversité (ISYEB ), Muséum national d'Histoire naturelle (MNHN)-École pratique des hautes études (EPHE), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Université Pierre et Marie Curie - Paris 6 (UPMC)-Centre National de la Recherche Scientifique (CNRS), Unité de recherche Mathématiques et Informatique Appliquées (MIA), Institut National de la Recherche Agronomique (INRA), Laboratoire de Probabilités et Modèles Aléatoires (LPMA), Centre National de la Recherche Scientifique (CNRS)-Université Paris Diderot - Paris 7 (UPD7)-Université Pierre et Marie Curie - Paris 6 (UPMC), Université Panthéon-Sorbonne (UP1), Biologie Intégrative des Populations, École pratique des hautes études (EPHE)-Centre National de la Recherche Scientifique (CNRS), Université Pierre et Marie Curie - Paris 6 (UPMC)-Université Paris Diderot - Paris 7 (UPD7)-Centre National de la Recherche Scientifique (CNRS), Muséum national d'Histoire naturelle (MNHN)-Université Pierre et Marie Curie - Paris 6 (UPMC)-École Pratique des Hautes Études (EPHE), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Centre National de la Recherche Scientifique (CNRS)
Jazyk: angličtina
Rok vydání: 2013
Předmět:
0106 biological sciences
multidimensional scaling
[SDV]Life Sciences [q-bio]
Molecular Sequence Data
Context (language use)
DNA sequences
[SDV.BC]Life Sciences [q-bio]/Cellular Biology
Biology
Moths
Bioinformatics
Barcode
010603 evolutionary biology
01 natural sciences
law.invention
03 medical and health sciences
Data visualization
Species Specificity
law
Genetics
Computer Graphics
Animals
Cluster Analysis
DNA Barcoding
Taxonomic

Multidimensional scaling
Cluster analysis
Projection (set theory)
Ecology
Evolution
Behavior and Systematics

Phylogeny
unsupervised algorithms
visualization
030304 developmental biology
0303 health sciences
[STAT.AP]Statistics [stat]/Applications [stat.AP]
business.industry
Pattern recognition
[MATH.MATH-PR]Mathematics [math]/Probability [math.PR]
Chemistry
ComputingMethodologies_PATTERNRECOGNITION
Nonlinear Dynamics
dissimilarity matrices
Unsupervised learning
Artificial intelligence
Murinae
business
median self-organizing maps
Biotechnology
Curse of dimensionality
Zdroj: Molecular Ecology Resources
Molecular Ecology Resources, Wiley/Blackwell, 2013, 13 (6), pp.976-990. ⟨10.1111/1755-0998.12047⟩
Molecular Ecology Resources, 2013, 13 (6), pp.976-990
Molecular Ecology Resources, Wiley/Blackwell, 2013, 13 (6), pp.976-990
Molecular ecology resources
Molecular Ecology Resources, 2013, 13 (6), pp.976-990. ⟨10.1111/1755-0998.12047⟩
ISSN: 1755-098X
1755-0998
DOI: 10.1111/1755-0998.12047⟩
Popis: International audience; Developing tools for visualizing DNA sequences is an important issue in the Barcoding context. Visualizing Barcode data can be put in a purely statistical context, unsupervised learning. Clustering methods combined with projection methods have two closely linked objectives, visualizing and finding structure in the data. Multidimensional scaling (MDS) and Self-organizing maps (SOM) are unsupervised statistical tools for data visualization. Both algorithms map data onto a lower dimensional manifold: MDS looks for a projection that best preserves pairwise distances while SOM preserves the topology of the data. Both algorithms were initially developed for Euclidean data and the conditions necessary to their good implementation were not satisfied for Barcode data. We developed a workflow consisting in four steps: collapse data into distinct sequences; compute a dissimilarity matrix; run a modified version of SOM for dissimilarity matrices to structure the data and reduce dimensionality; project the results using MDS. This methodology was applied to Astraptes fulgerator and Hylomyscus, an African rodent with debated taxonomy. We obtained very good results for both data sets. The results were robust against unbalanced species. All the species in Astraptes were well displayed in very distinct groups in the various visualizations, except for LOHAMP and FABOV that were mixed up. For Hylomyscus, our findings were consistent with known species, confirmed the existence of four unnamed taxa and suggested the existence of potentially new species.
Databáze: OpenAIRE