Popis: |
Data analysis and visualizations techniques (such as split-apply-combine) make extensive use of associative tabular data-structures that are cumbersome to use with common aggregation APIs (for arrays, lists or dictionaries). In these cases a fluent API for querying associative tabular data (like the ones provided by Pandas, Mathematica or LINQ) is more appropriate for interactive exploration environments. In Smalltalk despite the fact that many important analysis tools are already present (for e.g., in the PolyMath library), we are still missing this essential part of the data science toolkit. These specialized data structures for tabular datasets can provide us with a simple and powerful API for summarizing, cleaning, and manipulating a wealth of data-sources that are currently cumbersome to use. In this paper we introduce the DataFrame and DataSeries collections - that are specifically designed for working with structured data. We demonstrate how these tools can be used for descriptive statistics and Exploratory Data Analysis (EDA) - the critical first step of data analysis which allows us to get the summary of a dataset, detect mistakes, determine the relations, and select the appropriate model for further confirmatory analysis. We then detail the implementation trade-offs that we are currently facing in our implementation for Pharo and discuss future perspectives. |