Tabular benchmark dataset

4 min readFeb 10, 2023

In tabular data classification and regression problems, there’s no standard benchmark data like there is for image and text data. Since there is no universally accepted benchmarks for tabular data, researchers and practitioners have sought to create standard benchmarks by collecting public data from previous studies, Kaggle-like contests, and OpenML, each with their own standards and criteria for selection. This helps to ensure fair performance comparisons and reduce the potential for cherry-picking results.

1. OpenML-CC18

The classification problem benchmark, OpenML-CC18, created by OpenML is well-known in the machine learning community. OpenML provides a platform for sharing datasets, model pipelines, and experimental environments. The benchmark has been widely referenced in other studies and was even used as-is by the paper SCARF, which proposed a self-supervised model for tabular data. The benchmark consists of 72 datasets, and the screening criteria used to select the data include examples such as:

Class: 2 or more classes, each class has at least 20 samples. And the minority to major class ratio is at least 5%.
Samples: 500 or more and 100,000 or less
Number of variables: 5000 or less (after one hot encoding)
Excluding synthetic data, data that is too easy (if fully predicted by a simple decision tree), etc.

2. AutoML benchmark

The AutoML benchmark was developed through a comparison study of various AutoML libraries, including AutoGluon, Auto-sklearn, and FLAML. The benchmark encompasses 71 classification tasks and 33 regression tasks. An example of the selection criteria used is as follows:

Difficulty: Exclude too easy data (e.g., test error 0 achieved with simple decision trees)
Real-world: Excluding synthetic data that is difficult to solve real problems and image data that is easily solved by deep learning (However, in the case of synthetic data, datasets that are widely used and tackle challenging problems, like kr-vs-kp, are included. For image data, datasets are included if they deal with real-world problems and are difficult to solve.)
Diversity: It restricts to avoid datasets with similar problems to avoid being biased towards any particular domain (e.g., collecting only some of jm1, kc1, kc2, pc1, pc2, pc3, which deals with software quality problems)
i.i.d.: Datasets that possess a temporal nature or consist of repeated measurements were not considered.
Datasets have been compiled with varying numbers of samples, numbers of missing values, and the number of categorical or numerical data.

3. Tabular Classification Benchmark

The authors of this paper have created benchmarks for evaluating the performance of different models on tabular data. The benchmarks comprise 45 datasets that include both classification and regression tasks. The selection criteria used are as follows:

Heterogeneity: Datasets with homogeneous features are excluded (e.g., images, signals, etc.)
Not high dimensionality: Excluding dimension/sample ratio of 0.1 or higher
Undocumented data: Exclude data with too little information, such as data whose variable name is not disclosed. (However, it was included if the columns were found to be heterogeneous)
Cardinality: Categorical variables were disregarded if their cardinality was 20 or more, while numerical variables were excluded if their cardinality was less than 10. Numerical variables with a cardinality of 2 were converted to categorical variables.
And there are other criteria such as i.i.d., Real-world, excluding data that is too easy to solve.

Conclusion

If you examine the studies mentioned above, you’ll notice that there is a consistency in the benchmark datasets used:

Configuring the datasets fairly for performance comparison (e.g., by including a variety of samples and different levels of missing values, in order to assess the robustness of the models)

There are differences as well:

Each study defines variables in slightly different ways, dividing them into categories such as categorical, numerical, or categorical and continuous.
The methods for handling missing values also vary from study to study, for example, some studies may replace numerical variables with mean values, while others may remove rows with missing values after removing columns with a lot of missing values, and still others may remove all missing values altogether.

Creating a widely accepted benchmark for tabular data classification and regression seems to be a difficult task. This is because of the various interpretations and definitions of variables in the data, as well as differences in sample sizes and missing values. As a result, it is challenging to create standard benchmarks that are fair and unbiased.

Also, I realized while collecting tabular data from various sources that there is a scarcity of publicly available datasets, and the data often has minor errors. As a result, it is crucial to manually verify and classify the variables, such as determining if they are categorical or continuous. This is why, similar to what has been suggested in other papers mentioned earlier, gathering data based on specific criteria and making it publicly available on platforms like GitHub or OpenML could be highly beneficial.

Bischl, B., Casalicchio, G., Feurer, M., Gijsbers, P., Hutter, F., Lang, M., … & Vanschoren, J. (2017). Openml benchmarking suites. arXiv preprint arXiv:1708.03731.
Bahri, D., Jiang, H., Tay, Y., & Metzler, D. (2021). Scarf: Self-supervised contrastive learning using random feature corruption. arXiv preprint arXiv:2106.15147.
Gijsbers, P., Bueno, M. L., Coors, S., LeDell, E., Poirier, S., Thomas, J., … & Vanschoren, J. (2022). Amlb: an automl benchmark. arXiv preprint arXiv:2207.12560.
Grinsztajn, L., Oyallon, E., & Varoquaux, G. (2022). Why do tree-based models still outperform deep learning on tabular data?. arXiv preprint arXiv:2207.08815.

Tabular benchmark dataset

1. OpenML-CC18

2. AutoML benchmark

3. Tabular Classification Benchmark

Conclusion

Written by Young Ben

No responses yet