A comparison of machine learning methods for extremely unbalanced industrial quality data

Autor:	André Luiz Pilastri, Paulo Cortez, Pedro José Pereira, Adriana Pereira
Přispěvatelé:	Universidade do Minho
Jazyk:	angličtina
Rok vydání:	2021
Předmět:	Computer science media_common.quotation_subject Big data Automotive industry 02 engineering and technology Indústria inovação e infraestruturas Machine learning computer.software_genre 020204 information systems 0202 electrical engineering electronic engineering information engineering Quality (business) media_common Random Forest Science & Technology business.industry Ciências Naturais::Ciências da Computação e da Informação Autoencoder Random forest Undersampling Data quality Anomaly Detection Industrial Data 020201 artificial intelligence & image processing Anomaly detection Artificial intelligence business computer
Zdroj:	Progress in Artificial Intelligence ISBN: 9783030862299 EPIA
Popis:	The Industry 4.0 revolution is impacting manufacturing companies, which need to adopt more data intelligence processes in order to compete in the markets they operate. In particular, quality control is a key manufacturing process that has been addressed by Machine Learning (ML), aiming to improve productivity (e.g., reduce costs). However, modern industries produce a tiny portion of defective products, which results in extremely unbalanced datasets. In this paper, we analyze recent big data collected from a major automotive assembly manufacturer and related with the quality of eight products. The eight datasets in- clude millions of records but only a tiny percentage of failures (less than 0.07%). To handle such datasets, we perform a two-stage ML comparison study. Firstly, we consider two products and explore four ML algorithms, Random Forest (RF), two Automated ML (AutoML) methods and a deep Autoencoder (AE), and three balancing training strategies, namely None, Synthetic Minority Oversampling Technique (SMOTE) and Gaussian Copula (GC). When considering both classification performance and computational effort, interesting results were obtained by RF. Then, the selected RF was further explored by considering all eight datasets and five balancing methods: None, SMOTE, GC, Random Undersampling (RU) and Tomek Links (TL). Overall, competitive results were achieved by the combination of GC with RF. This work is supported by: European Structural and Investment Funds in the FEDER component, through the Operational Competitiveness and Internation- alization Programme (COMPETE 2020) [Project n 39479; Funding Reference: POCI-01-0247-FEDER-39479].
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::25f5fc7f2bd5af1bc6fff7bd9883aec9 https://hdl.handle.net/1822/73976 Zobrazit plný text záznamu