Popis: |
A large volume of data has become commonplace in many domains these days. Machine learning algorithms can be trained to look for any useful hidden patterns in such data. Sometimes, these big data might need to be summarized to make them into a manageable size, for example by using histograms, for various reasons. Traditionally, machine learning algorithms can be trained on data expressed as real numbers and/or categories but not on a complex structure such as histogram. Since machine learning algorithms that can learn from data with histograms have not been explored to a major extent, this thesis intends to further explore this domain. This thesis has been limited to classification algorithms, tree-based classifiers such as decision trees, and random forest in particular. Decision trees are one of the simplest and most intuitive algorithms to train. A single decision tree might not be the best algorithm in term of its predictive performance, but it can be largely enhanced by considering an ensemble of many diverse trees as a random forest. This is the reason why both algorithms were considered. So, the objective of this thesis is to investigate how one can adapt these algorithms to make them learn better on histogram data. Our proposed approach considers the use of multiple bins of a histogram simultaneously to split a node during the tree induction process. Treating bins simultaneously is expected to capture dependencies among them, which could be useful. Experimental evaluation of the proposed approaches was carried out by comparing them with the standard approach of growing a tree where a single bin is used to split a node. Accuracy and the area under the receiver operating characteristic (ROC) curve (AUC) metrics along with the average time taken to train a model were used for comparison. For experimental purposes, real-world data from a large fleet of heavy duty trucks were used to build a component-failure prediction model. These data contain information about the operation of trucks over the years, where most operational features are summarized as histograms. Experiments were performed further on the synthetically generated dataset. From the results of the experiments, it was observed that the proposed approach outperforms the standard approach in performance and compactness of the model but lags behind in terms of training time. This thesis was motivated by a real-life problem encountered in the operation of heavy duty trucks in the automotive industry while building a data driven failure-prediction model. So, all the details about collecting and cleansing the data and the challenges encountered while making the data ready for training the algorithm have been presented in detail. |