Active Learning and Machine Learning.
Deep Learning is the de facto algorithmic backbone for all Computer Vision tasks in the modern world, from image classification and segmentation to scene reconstruction and image synthesis. However, the use of an enormous amount of good-quality labeled data, i.e., examples whose ground truth is already available for training a model, has been the foundation for the success of most algorithms (a technique called Supervised Learning).
Data labeling is a time-consuming process that accounts for roughly 80% of the time spent on a Machine Learning project. Even so, some labels may be incorrect, negatively impacting model training. As a result, current methods emphasize reducing the need for labeled training data and making use of the vast amount of unstructured data available in this information technology era.
Active Learning is one such low supervision method, which falls under the category of “Semi-Supervised Learning,” a learning paradigm in which a small amount of labeled data and a large amount of unlabeled data are combined to train a model.
What exactly is Active Learning?
Consider the following before moving on to the definition of active learning:
A deep learning network tasked with classifying dog and cat images does not simply determine whether an image depicts a dog or a cat.
It is far more complicated than that.
In its task, the network makes predictions with a “confidence” score, which tells us how certain the network is of its own prediction (much like us, humans, function).
Active Learning is a “human-in-the-loop” Deep Learning framework that uses a large dataset with only a small portion labeled for model training (say, 10%). Assume you have a dataset of 1,000 samples, 100 of which are labeled. An Active Learning model will be trained on the 100 samples and will make predictions on the remaining 900 samples (test set). Assume that 10 of these 900 samples had very low prediction confidence. The model will now request labels for these ten samples from a human user. That is, an Active Learning framework is interactive, which is how the term “Active” came about.
Because they are confusing or difficult samples, the ten samples in the preceding example will provide the model with the most learning opportunities (samples whose embedding lies close to the classification boundary). Instead of spending hours labeling samples of the same category that the model could have generalized on without any help in the first place, the human annotator only has to label a handful of the discriminative samples that the model requests labels for. This significantly reduces the effort required to label data.
Active Learning closely resembles the human learning process. A teacher first trains a student on the fundamentals (the first 10% of labeled data). The student is then assigned tasks to complete under the teacher’s close supervision (model training on the small labeled data). The student is then instructed to solve problems on their own (predictions on the test set) and to seek assistance only if they become truly stuck (very low confidence predictions on 10 samples). Using the knowledge gained from previous mistakes, the student can now perform the same task with much greater accuracy and with much less external assistance.
Next, let’s delve a little deeper into the technical aspects of Active Learning algorithms.
Query Strategies for Active Learning
Active Learning is a method for efficiently utilizing labeling resources. It enhances training efficiency by selecting the most valuable data samples from an unlabeled database for user annotation (query). The training samples chosen are mostly unfamiliar and uncertain to the current model. Active learning methods are classified into three types: (1) Stream-based selective sampling, (2) Pool-based sampling, and (3) Query Synthesis methods.
Let’s take a look at each of these methods one by one.
Selective Stream Sampling
Sampling is based on a continuous stream of unlabeled data points. The current model and an informativeness measure (which we’ll look at in more detail in the following section) are used to determine whether or not to ask the user(s) for an annotation for each incoming data point based on a pre-selected threshold value.
This query type is typically computationally inexpensive, but it provides limited performance benefits due to the isolated nature of each decision: the underlying distribution’s larger context is not considered. As a result, this sampling method captures balancing exploration and exploitation of the distribution better than other query types.
The PAL, or Policy-based Active Learning, framework, for example, uses stream-based selective sampling for active learning in a deep reinforcement learning problem. PAL learns a dynamic active learning strategy from data, allowing it to be used in other data settings, such as cross-lingual applications. Instead of using a fixed heuristic, PAL learns how to actively select data, which is formalized as a reinforcement learning problem.