Knowledge distillation is an area of research into more efficient Transformers which trains small models (students) by encouraging them to reproduce the outputs of large models (teachers). This is a technique which initially gained popularity on classification tasks in computer vision, but has been successfully applied in several domains, including LLMs. If you start from...