Parallel processing with treeClust

Autor:	McKechnie, I. Taylor
Přispěvatelé:	Buttrey, Samuel E., Whitaker, Lyn R., Operations Research (OR)
Rok vydání:	2017
Předmět:	tree clusters high performance computing parallel processing information systems technology big data sets batch scripting
Popis:	Clustering data is one of the most common statistical and machine learning techniques for analyzing big data. Clustering can be particularly difficult when the data sets include categorical, missing, or noise variables. The tree clustering algorithm developed by Samuel Buttrey and Lyn Whitaker, as described in the December 2015 issue of The R Journal, seems to provide a solution to these problems, but it requires a large set of overhead computations. This issue is intensified when working with high-dimensional data because the extent of treeClust’s overhead computations are based on the dimensions of the data. High performance computing (HPC) and parallel processing present a solution to this overhead computation burden, but treeClust’s existing parallel processing method does not work on the Naval Postgraduate School’s HPC, the Hamming Supercomputer (HSC). Furthermore, correctly determining what HPC resources to use can be a difficult task. In this thesis, we present a new HSC-specific method for parallel processing data using the treeClust R package developed by Buttrey and Whitaker. Based on the results of our experiments, our method approximates the optimal resource HPC request, so that users realize the best run time when using treeClust on the HSC. http://archive.org/details/parallelprocessi1094556158 Captain, United States Marine Corps Approved for public release; distribution is unlimited.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=od______2778::549671a11e07dbc1247cb8a6206277b0 https://hdl.handle.net/10945/56158 Zobrazit plný text záznamu