Guided Labeling Episode 3: Model Uncertainty

paolo-tamagnini_600x448.jpg

Click to learn more about author Paolo Tamagnini.

In this series, we’ve been exploring the topic of guided
labeling by looking at active learning and label density. In the first episode,
we introduced the topic of active learning and active learning sampling and
moved on to look at label density in the second article. Here are the links to
the two previous episodes:

In this third episode, we are moving on to look at model uncertainty.

Using label density, we explore the
feature space and retrain the model each time with new labels that are both representative
of a good subset of unlabeled data and different from already labeled data of
past iterations. However, besides selecting data points based on the overall
distribution, we should also prioritize missing labels based on the attached
model predictions. In every iteration, we can score the data that still needs
to be labeled with the retrained model. What can we infer, given those
predictions by the constantly retrained model?

Before we can answer this question, there is another common concept in machine learning classification related to the feature space: the decision boundary. The decision boundary defines a hyper-surface in the feature space of n dimensions, which separates data points depending on the predicted label.

In Figure 1 below, we point again to our data set with only two columns: weight and height. In this case, the decision boundary is a line-drawn machine learning model to predict overweight and underweight conditions. In this example, we use a line. However, we could have also used a curve or a closed shape.

Figure 1: In the 2D feature space of weight vs. height, we train a machine learning model to distinguish overweight and underweight subjects. The model prediction is visually and conceptually represented by the decision boundary — a line dividing the subjects in the two categories.

So let’s say we are training an SVM model — starting with no labels and using active learning. That means we are trying to find the right line. We label a few subjects, in the beginning, using label density. Subjects are labeled by simply applying a heuristics called body mass index — no need for a domain expert in this simple example.

In the beginning, the position of the decision boundary will
probably be wrong as it is based on only a few data points in the densest
areas. However, the more labels you keep adding, the more the line will
position itself closer to the actual separation between the two classes. Our
focus here is to move this decision boundary to the right position using as few
labels as possible. In active learning, this means using as little time as
possible of our expensive human-in-the-loop expert. 

To use fewer labels, we need data points positioned around the decision boundary, as these are the data points best defining the line we are looking for. But how do we find them, not knowing where this decision boundary lies? The answer is, we use model predictions — and, to be more precise — we use model certainty.

Figure 2: In the 2D dimensional feature space, the dotted decision boundary belongs to the model trained in the current iteration k. To move the decision boundary in the right direction, we use uncertainty sampling, asking the user to label new data points near to the current decision boundary. We then identify misclassification, which subsequently leads to a better decision boundary in the next iteration after the model is retrained.

Looking for Misclassification Using Uncertainty

At each iteration, the decision boundary moves when a new point
is labeled contradicting the model prediction. The intuition behind model
certainty is that a misclassification is more likely to happen when the model
is uncertain of its prediction. When the model has already achieved decent
performance, model uncertainty is symptomatic of misclassification being more
probable, i.e., a wrong prediction. In the feature space, model uncertainty
increases as you get closer to the decision boundary. To quickly move our
decision boundary to the right position, we, therefore, look for
misclassification using uncertainty. In this manner, we select data points that
are close to the actual decision boundary (Figure 2). 

So here we go. At each iteration, we score all unlabeled data
points with the retrained model. Next, we compute the model uncertainty, take
the top uncertain predictions, and ask the user to label them. By retraining
the model with all of the corrected predictions, we are likely to move the
decision boundary in the right direction and achieve better performance with fewer
labels. 

How Do We
Measure Model Certainty/Uncertainty?

There are different metrics; we are going to use the entropy score (Formula 1 below). This is a concept common in information theory. High entropy is a symptom of high uncertainty. This strategy is also known as uncertainty sampling, and you can find the details in this blog titled Labeling with Active Learning, which was first published in Data Science Central.

Formula 1: Prediction Entropy Formula. Given a prediction for row x by the classification model, we can retrieve a probability vector P(l/x), which sums up to 1 and shows the different n probability of a row to belong to a possible target class li. Using such a prediction vector, we can measure its entropy score between 0 and 1 to define the uncertainty of the model in predicting P(l/x).

Wrapping up

In today’s episode, we’ve taken a look at how model uncertainty
can be used as a rapid way of moving our decision boundary to the correct
position, using as few labels as possible, i.e., taking up as little time as
possible with our expensive human-in-the-loop expert. 

In the next blog in this series, we will go on to use
uncertainty sampling to exploit the key areas of the feature space to ensure an
improvement of the decision boundary. Stay
tuned.

Credit: Source link