Towards Benchmarking Feature Type Inference for AutoML Platforms
Autor: | Jonathan Lacanlale, Arun Kumar, Kevin Yang, Vraj Shah, Premanand Kumar |
---|---|
Rok vydání: | 2021 |
Předmět: |
Feature engineering
business.industry Computer science Model selection Type inference 02 engineering and technology Machine learning computer.software_genre Workflow Feature (computer vision) 020204 information systems 0202 electrical engineering electronic engineering information engineering Benchmark (computing) 020201 artificial intelligence & image processing Artificial intelligence business Categorical variable computer Semantic gap |
Zdroj: | SIGMOD Conference |
DOI: | 10.1145/3448016.3457274 |
Popis: | The paradigm of AutoML has created an opportunity to enable ML for the masses. Emerging industrial-scale cloud AutoML platforms aim to automate the end-to-end ML workflow. While many works have looked into automated feature engineering, model selection, or hyper-parameter search in AutoML, little work has studied a crucial step that serves as an entry point to this workflow: ML feature type inference. The semantic gap between attribute types (e.g., strings, numbers) in databases/files and ML feature types (e.g., Numeric, Categorical) necessitates type inference. In this work, we formalize and standardize this task by creating the first ever benchmark labeled dataset, which we use to objectively evaluate existing AutoML tools. Our dataset has 9921 examples and a 9-class label vocabulary. Our labeled data also offers an alternative approach to automate this task than existing rule-based or syntax-based approaches: use ML itself to predict feature types. We collate a benchmark suite of 30 classification and regression tasks to assess the importance of type inference for downstream models. Empirical comparison on our labeled data shows that an ML-based approach delivers a lift of an average 14% and up to 38% in accuracy for identifying feature types compared to prominent industrial tools. Our downstream benchmark suite reveals that the ML-based approach outperforms existing industrial-strength tools for 47 out of 60 downstream models. We release our labeled dataset, models, and downstream benchmarks in a public repository with a leaderboard. |
Databáze: | OpenAIRE |
Externí odkaz: |