GigaSOM.jl: High-performance clustering and visualization of huge cytometry datasets

Autor: Markus Ollert, Oliver Hunewald, Reinhard Schneider, Laurent Heirendt, Venkata P. Satagopam, Vasco Verissimo, Jiří Vondrášek, Christophe Trefois, Miroslav Kratochvíl
Přispěvatelé: ELIXIR CZ LM2018131 (MEYS) [sponsor], FNR AFR-RIKEN bilateral program (TregBar 2015/11228353) [sponsor], FNR PRIDE Doctoral Training Unit program (PRIDE/11012546/NEXTIMMUNE) [sponsor], Institute of Organic Chemistry and Biochemistry of the CAS (RVO: 61388963) [sponsor], ELIXIR Staff Exchange programme 2020 [sponsor], University of Luxembourg: High Performance Computing - ULHPC [research center], Luxembourg Centre for Systems Biomedicine (LCSB): Bioinformatics Core (R. Schneider Group) [research center]
Rok vydání: 2020
Předmět:
Self-organizing map
single-cell cytometry
Speedup
Computer science
AcademicSubjects/SCI02254
Health Informatics
self-organizing maps
Multidisciplinary
general & others [F99] [Life sciences]

computer.software_genre
Multidisciplinaire
généralités & autres [C99] [Ingénierie
informatique & technologie]

Multidisciplinaire
généralités & autres [F99] [Sciences du vivant]

03 medical and health sciences
Mice
0302 clinical medicine
Software
Technical Note
Animals
Cluster Analysis
Cluster analysis
030304 developmental biology
dimensionality reduction
0303 health sciences
business.industry
Dimensionality reduction
Multidisciplinary
general & others [C99] [Engineering
computing & technology]

Process (computing)
Julia
high-performance computing
Supercomputer
Computer Science Applications
Visualization
Data point
Scalability
Key (cryptography)
AcademicSubjects/SCI00960
Programming Languages
Data mining
business
computer
Algorithms
030215 immunology
clustering
Zdroj: GigaScience
Kratochvíl, M, Hunewald, O, Heirendt, L, Verissimo, V, Vondrášek, J, Satagopam, V P, Schneider, R, Trefois, C & Ollert, M 2020, ' GigaSOM.jl : High-performance clustering and visualization of huge cytometry datasets ', GigaScience, vol. 9, no. 11, giaa127 . https://doi.org/10.1093/gigascience/giaa127
ISSN: 2047-217X
DOI: 10.1093/gigascience/giaa127
Popis: BackgroundThe amount of data generated in large clinical and phenotyping studies that use single-cell cytometry is constantly growing. Recent technological advances allow to easily generate data with hundreds of millions of single-cell data points with more than 40 parameters, originating from thousands of individual samples. The analysis of that amount of high-dimensional data becomes demanding in both hardware and software of high-performance computational resources. Current software tools often do not scale to the datasets of such size; users are thus forced to down-sample the data to bearable sizes, in turn losing accuracy and ability to detect many underlying complex phenomena.ResultsWe present GigaSOM.jl, a fast and scalable implementation of clustering and dimensionality-reduction for flow and mass cytometry data. The implementation of GigaSOM.jl in the high-level and high-performance programming language Julia makes it accessible to the scientific community, and allows for efficient handling and processing of datasets with billions of data points using distributed computing infrastructures. We describe the design of GigaSOM.jl, measure its performance and horizontal scaling capability, and showcase the functionality on a large dataset from a recent study.ConclusionsGigaSOM.jl facilitates utilization of the commonly available high-performance computing resources to process the largest available datasets within minutes, while producing results of the same quality as the current state-of-art software. Measurements indicate that the performance scales to much larger datasets. The example use on the data from an massive mouse phenotyping effort confirms the applicability of GigaSOM.jl to huge-scale studies.Key pointsGigaSOM.jl improves the applicability of FlowSOM-style single-cell cytometry data analysis by increasing the acceptable dataset size to billions of single cells.Significant speedup over current methods is achieved by distributed processing and utilization of efficient algorithms.GigaSOM.jl package includes support for fast visualization of multidimensional data.
Databáze: OpenAIRE
Nepřihlášeným uživatelům se plný text nezobrazuje