Logo Logo
Switch Language to German

Rodemann, Julian ORCID logoORCID: https://orcid.org/0000-0001-6112-4136; Fischer, Sebastian; Schneider, Lennart; Nalenz, Malte and Augustin, Thomas (13. December 2022): Not All Data Are Created Equal: Lessons From Sampling Theory For Adaptive Machine Learning. IMS International Conference on Statistics and Data Science (ICSDS), Florenz, December 13, 2022 - December 16, 2022. [PDF, 1MB]

[thumbnail of poster_ICSDS_IMS-3.pdf]
Download (1MB)


In survey methodology, inverse probability weighted (Horvitz-Thompson) estimation has become an indispensable part of statistical inference. This is triggered by the need to deal with complex samples, that is, non-identically distributed data. The general idea is that weighting observations inversely to their probability of being included in the sample produces unbiased estimators with reduced variance.

In this work, we argue that complex samples are subtly ubiquitous in two promising subfields of data science: Self-Training in Semi-Supervised Learning (SSL) and Bayesian Optimization (BO). Both methods rely on refitting learners to artificially enhanced training data. These enhancements are based on pre-defined criteria to select data points rendering some data more likely to be added than others. We experimentally analyze the distance from the so-produced complex samples to i.i.d. samples by Kullback-Leibler divergence and maximum mean discrepancy. What is more, we propose to handle such samples by inverse probability weighting. This requires estimation of inclusion probabilities. Unlike for some observational survey data, however, this is not a major issue since we excitingly have tons of explicit information on the inclusion mechanism. After all, we generate the data ourselves by means of the selection criteria.

To make things more tangible, consider the case of BO first. It optimizes an unknown function by iteratively approximating it through a surrogate model, whose mean and standard error estimates are scalarized to a selection criterion. The arguments of this criterion's optima are evaluated and added to the training data. We propose to weight them by means of the surrogate model's standard errors at time of selection. For the case of deploying random forests as surrogate models, we refit them by weighted drawing in the bootstrap sampling step. Refitting may be done iteratively aiming at speeding up the optimization or after convergence aiming at providing applicants with a (global) interpretable surrogate model.

Similarly, self-training in SSL selects instances from a set of unlabeled data, predicts its labels and adds these pseudo-labeled data to the training data. Instances are selected according to a confidence measure, e.g. the predictive variance. Regions in the feature space where the model is very confident are thus over-represented in the selected sample. We again explicitly exploit the selection criteria to define weights which we use for resampling-based refitting of the model. Somewhat counter-intuitively, the more confident the model is in the self-assigned labels, the lower their weights should be to counteract the selection bias. Preliminary results suggest this can increase generalization performance.

Actions (login required)

View Item View Item