Study suggests that AI model selection might introduce bias

Join GamesBeat Summit 2021 this April 28-29. Register for a free or VIP pass today.

The past several years have made it clear that AI and machine learning are not a panacea when it comes to fair outcomes. Applying algorithmic solutions to social problems can magnify biases against marginalized peoples; undersampling populations always results in worse predictive accuracy. But bias in AI doesn’t arise from the datasets alone. Problem formulation, or the way researchers fit tasks to AI techniques, can contribute. So can other human-led steps throughout the AI deployment pipeline.

To this end, a new study coauthored by researchers at Cornell and Brown University investigates the problems around model selection — the process by which engineers choose machine learning models to deploy after training and validation. They found that model selection presents another opportunity to introduce bias, because the metrics used to distinguish between models are subject to interpretation and judgement.

In machine learning, a model is typically trained on a dataset and evaluated for a metric (e.g., accuracy) on a test dataset. To improve performance, the learning process can be repeated. Retraining until a satisfactory model of several is produced is what’s known as a “researcher degree of freedom.”

While researchers may report average performance across a small number of models, they often publish results using a specific set of variables that can obscure a model’s true performance. This presents a challenge because other model properties can change during training. Seemingly minute differences in accuracy between groups can multiply out to large groups, impacting fairness with regard to certain demographics.

The coauthors underline a case study in which test subjects were asked to choose a “fair” skin cancer detection model based on metrics they identified. Overwhelmingly, the subjects selected a model with the highest accuracy even though it exhibited the largest disparity between males and females. This is problematic on its face, the researchers say, because the accuracy metric doesn’t provide a breakdown of false positives (missing a cancer diagnosis) and false negatives (mistakenly diagnosing cancer when it’s in fact not present). Including these metrics could’ve biased the subjects to make different choices concerning which model was “best.”

“The overarching point is that contextual information is highly important for model selection, particularly with regard to which metrics we choose to inform the selection decision,” the coauthors of the study wrote. “Moreover, sub-population performance variability, where the sub-populations are split on protected attributes, can be a crucial part of that context, which in turn has implications for fairness.”

Beyond model selection and problem formulation, research is beginning to shed light on the various ways humans might contribute to bias in models. For example, researchers at MIT found just over 2,900 errors arising from labeling mistakes in ImageNet, an image database used to train countless computer vision algorithms. A separate Columbia study concluded that biased algorithmic predictions are mostly caused by imbalanced data but that the demographics of engineers also play a role, with models created by less diverse teams generally faring worse.

In future work, the Cornell and Brown University say they intend to see if they can ameliorate the issue of performance variability through “AutoML” methods, which divests the model selection process from human choice. But the research suggests that new approaches might be needed to mitigate every human-originated source of bias.


  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Source: Read Full Article