Bottom-Up Standardization For Data Preparation

Autor: Lai, Eugenie Y.
Rok vydání: 2024
Druh dokumentu: Diplomová práce
Popis: Data preparation is an essential step in every data-related effort, from scientific projects in academia to data-driven decision-making in industry. Typically, data preparation is not the novel or interesting piece of a project — it transforms raw data into a format that enables further innovative work. Because data preparation scripts are never intended to be interesting, are project-specific, and are written in general-purpose languages, they can be tedious to understand and check. As a result, data preparation scripts can easily become a breeding ground for poor engineering and statistical practices. Ideally, data preparation scripts are “admirably boring” — they should serve the project, but otherwise be as simple and as standard as possible. We propose a bottom-up script standardization framework that takes a user’s data preparation script and transforms it into a simpler, more standardized, more boring version of itself. Our framework takes the user’s input script not as an unchangeable definition of correctness, but as a semantic sketch of the user’s overall intent. We present an algorithmic framework and implemented a prototype system. We evaluate our approach against state-of-the-art methods, including GPT-4, on six real-world datasets. Our approach improves script standardization by 39.5% while not meaningfully changing the user’s intent, while GPT-4 achieves 2.9%.
S.M.
Databáze: Networked Digital Library of Theses & Dissertations