Feature engineering, dimensionality reduction and interpretability through autoencoders for structured data

Autor: Bofarull Cabello, Antoni
Přispěvatelé: Universitat Politècnica de Catalunya. Departament de Ciències de la Computació, Arias Vicente, Marta, Arratia Quesada, Argimiro Alejandro
Jazyk: angličtina
Rok vydání: 2021
Předmět:
Zdroj: UPCommons. Portal del coneixement obert de la UPC
Universitat Politècnica de Catalunya (UPC)
Popis: Machine Learning is the area of Artificial Intelligence where algorithms learn from data. Therefore, making a good selection of features is essential for the models to perform their tasks in the best possible way. We employ a denoising autoencoder architecture and extend it to take advantage of the aggregation of features from different contexts using several dilated convolutions. We apply sparse group Lasso regularization to cluster them and automatically identify which ones are the most relevant. In addition to bottleneck neurons to determine if we can further reduce the dimensionality. Besides reconstruction, we include an extra output from the bottleneck that performs classification. Multi-task learning leverages context-specific information that improves the quality of the encoding. Deep Learning models have always been commonly considered black-boxes. However, due to the significant difference in performance compared to interpretable models such as linear regression, it has not been a problem in contexts where understanding the models is not as relevant as obtaining good results. In this project, we study the interpretability of models by using the Shapley value method and its extensions. In the practical part, we have empirically studied the proposed model. The results show that the network architecture can identify the most relevant dilation. On the one hand, we can perform a global interpretation of the model by looking at the weights as we do in linear regression. The advantage over other models is that we group the weights by kernels of the dilated convolutions. On the other hand, through the input-output importance matrix using Shapley Values, we can identify which parts of an instance are most relevant to reconstruct its output.
Databáze: OpenAIRE