Parallel analysis in the MDAnalysis Library: Benchmark of Trajectory File Formats

Autor:	Mahzad Khoshlessan, Beckstein, Oliver
Rok vydání:	2017
DOI:	10.6084/m9.figshare.4695742
Popis:	MDAnalysis (http://mdanalysis.org) is a Python library to analyze molecular dynamics (MD) trajectories generated by all major MD simulation packages. MDAnalysis enables users to access the raw simulation data through a uniform object-oriented Python interface and to perform structural and temporal analysis of their simulations. Simulations are continusously increasing in size and length and thus the amount of data to be analyzed is growing rapidly and analysis is increasingly becoming a bottleneck.Parallel approaches are needed to increase analysis throughput but MDAnalysis does not yet provide a standard interface for parallel analysis; instead, various existing parallel libraries are currently used to parallelize MDAnalysis-based code. In this work, we describe a benchmark suite that can be used to evaluate performance for parallel map-reduce type analysis and use it investigate the performance of MDAnalysis with the Dask library for task-graph based distributed computing (http://dask.pydata.org/). As the computational task we perform an optimal structural superposition of the atoms of a protein to a reference structure by minimizing the RMSD of the Cα atoms. A range of commonly used MD file formats (CHARMM/NAMD DCD, Gromacs XTC, Amber NetCDF) and different trajectory sizes are benchmarked on different high performance computing (HPC) resources, ranging from XSEDE supercomputers with SSD or Lustre storage to local heterogeneous workstations with Gigabit-linked network file system or locally attached SSDs. The benchmarks show a strong dependence of the overall execution time on the file format and the hardware. DCD is the fastest format to read but only scales to moderate core numbers when the files are served from SSDs; in general, contention of parallel workers for the file prevents scaling on most hardware. XTC appears overall as the most balanced format with consistently strong scaling and efficiency across most resource configurations. Parallelization within a node (up to 24 processes) with the dask multiprocessing scheduler is generally beneficial but parallelization across multiple nodes (with dask distributed) only shows weak gains, likely due to contention on the network that may slow down individual tasks and lead to overall waits and poor load balancing. Overall, obtaining good parallel performance with a MapReduce approach for trajectory analysis is strongly dependent on efficient transfer of trajectory data to memory and this work provides guidelines for how choosing the right trajectory format can lead to good performance on the given hardware.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::1ad44fda8e1ea03e253d3f021df0a40c Zobrazit plný text záznamu