Knowledge distillation is a machine learning technique used to transfer knowledge from a large, complex model (called the teacher) to a smaller, simpler model (called the student). Instead of training the student on the original dataset (with “hard” labels), the student learns from the soft outputs (probability distributions) of the teacher.
These soft outputs carry more nuanced information about the teacher’s understanding of the data. For example, how confident it is in its predictions and how it sees relationships between classes. This makes the student model nearly as accurate as the teacher while being more efficient in terms of speed and resource usage.
Knowledge distillation is also known as “neural network distillation”.
Knowledge distillation is often used for:
For example, a deep neural network trained for image classification that achieves 95% accuracy is too heavy to run on a smartphone. Instead, a smaller neural network can be trained using the larger model’s output probabilities as training targets. This means the smaller one “mimics” the larger model’s behavior, often achieving similar performance at a fraction of the computational cost.
Here’s a step-by-step explanation of how LLM knowledge distillation works:
There are three main types of Knowledge Distillation:
In this type of Knowledge Distillation, the student model mimics the teacher’s predictions based on the final output layer. This is done by minimizing the distillation loss with a loss function. This loss is reduced over time.
In feature-based knowledge distillation, the student mimics feature activations from the teacher’s intermediate and output layers. This is done by the loss function by minimizing the differences teacher’s and student’s feature activations.
The third type of Knowledge Distillation is based on the relationships between the data samples and layers. It correlates feature maps, graphs, similarity matrix, feature embeddings, or probabilistic distributions based on feature representations. The student either distills information after the teacher model has been trained, at the same time the teacher model is being trained or by using the same network as the teacher.
Knowledge distillation offers several key advantages for real-world applications. Here are the main benefits:
Knowledge distillation is a powerful technique, but it has several limitations that are important to consider:
Knowledge distillation is valuable when there’s a need to deploy high-performing models in resource-constrained environments. Here are some of its most impactful applications: