Distributed analysis of vertically partitioned sensor measurements under communication constraints

Autor: Stolpe, Marco
Přispěvatelé: Morik, Katharina, Rehof, Jakob, Brefeld, Ulf
Jazyk: angličtina
Rok vydání: 2017
Předmět:
DOI: 10.17877/de290r-17839
Popis: Nowadays, large amounts of data are automatically generated by devices and sensors. They measure, for instance, parameters of production processes, environmental conditions of transported goods, energy consumption of smart homes, traffic volume, air pollution and water consumption, or pulse and blood pressure of individuals. The collection and transmission of data is enabled by electronics, software, sensors and network connectivity embedded into physical objects. The objects and infrastructure connecting such objects are called the Internet of Things (IoT). In 2010, already 12.5 billion devices were connected to the IoT, a number about twice as large as the world's population at that time. The IoT provides us with data about our physical environment, at a level of detail never known before in human history. Understanding such data creates opportunities to improve our way of living, learning, working, and entertaining. For instance, the information obtained from data analysis modules embedded into existing processes could help their optimization, leading to more sustainable systems which save resources in sectors such as manufacturing, logistics, energy and utilities, the public sector, or healthcare. IoT's inherent distributed nature, the resource constraints and dynamism of its networked participants, as well as the amounts and diverse types of data collected are challenging even the most advanced automated data analysis methods known today. Currently, there is a strong research focus on the centralization of all data in the cloud, processing it according to the paradigm of parallel high-performance computing. However, the resources of devices and sensors at the data generating side might not suffice to transmit all data. For instance, pervasive distributed systems such as wireless sensors networks are highly communication-constrained, as are streaming high throughput applications, or those where data masses are simply too huge to be sent over existing communication lines, like satellite connections. Hence, the IoT requires a new generation of distributed algorithms which are resource-aware and intelligently reduce the amount of data transmitted and processed throughout the analysis chain. This thesis deals with the distributed analysis of vertically partitioned sensor measurements under communication constraints, which is a particularly challenging scenario. Here, not observations are distributed over nodes, but their feature values. The learning of accurate prediction models may require the combination of information from different nodes, necessarily leading to communication. The main question is how to design communication-efficient algorithms for the scenario, while at the same time preserving sufficient accuracy. The first part of the thesis introduces fundamental concepts. An overview of the IoT and its many applications is given, with a special focus on data analysis, the vertically partitioned data scenario, and accompanying research questions. Then, basic notions of machine learning and data mining are introduced. A selection of existing distributed data mining approaches is presented and discussed in more detail. Distributed learning in the vertically partitioned data scenario is then motivated by a smart manufacturing case study. In a hot rolling mill, different machines assess parameters describing the processing of single steel blocks, whose quality should be predicted as early as possible, by analysis of distributed measurements. Each machine creates not single value series, but many of them. Their heterogeneity leads to challenging questions concerning the steps of preprocessing and finding a good representation for learning, for which solutions are proposed. Another problem is that quality information is not given for individual blocks, but charges of blocks. How can we nevertheless predict the quality of individual blocks? Time constraints lead to questions typical for the vertically partitioned data scenario. Which data should be analyzed locally, to match the constraints, and which should be sent to a central server? Learning from aggregated label information is a relatively novel problem in machine learning research. A new algorithm for the task is developed and evaluated, the Learning from Label Proportions by Clustering (LLPC) algorithm. The algorithm's performance is compared to three other state-of-the-art approaches, in terms of accuracy and running time. It can be shown that LLPC achieves results with lower running time, while accuracy is comparable to that of its competitors, or significantly higher. The proposed algorithm comes with many other benefits, like ease of implementation and a small memory footprint. For highly decentralized systems, the Training of Local Models from (Label) Counts (TLMC) algorithm is proposed. The method builds on LLPC, reducing communication by transferring only label counts for batches of observations between nodes. Feasibility of the approach is demonstrated by evaluating the algorithm's performance in the context of traffic flow prediction. It is shown that TLMC is much more communication-efficient than centralization of all data, but that accuracy can nevertheless compete with that of a centrally trained global model. Finally, a communication-efficient distributed algorithm for anomaly detection is proposed, the Vertically Distributed Core Vector Machine (VDCVM). It can be shown that the proposed algorithm communicates up to an order of magnitude less data during learning, in comparison to another state-of-the-art approach, or training a global model by the centralization of all data. Nevertheless, in many relevant cases, the VDCVM achieves similar or even higher accuracy on several controlled and benchmark datasets. A main result of the thesis is that communication-efficient learning is possible in cases where features from different nodes are conditionally independent, given the target value to be predicted. Most efficient are local models, which exchange label information between nodes. In comparison to consensus algorithms, which transmit labels repeatedly, TLMC sends labels only once between nodes. Communication could be even reduced further by learning from counts of labels. In the context of traffic flow prediction, the accuracy achieved is still sufficient in comparison to centralizing all data and training a global model. In the case of anomaly detection, similar results could be achieved by utilizing a sampling approach which draws only as many observations as needed to reach a (1+ε)-approximation of the minimum enclosing ball (MEB). The developed approaches have many applications in communication-constrained settings, in the sectors mentioned above. It has been shown that data can be reduced and learned from before it even enters the cloud. Decentralized processing might thus enable the analysis of big data masses, the more devices are getting connected to the IoT.
Databáze: OpenAIRE