How to train LLMs with knowledge distillation

Knowledge distillation is an area of research into more efficient Transformers which trains small models (students) by encouraging them to reproduce the outputs of large models (teachers). This is a technique which initially gained popularity on classification tasks in computer vision, but has been successfully applied in several domains, including LLMs.

If you start from scratch, you have to train a large model with generic labelled data. Then, you have to train a small model to mimic the large model using task-specific unlabelled data (and task-specific labelled data, if available). While this process still involves training a large model, it is a (still super expensive) one-off cost. Fortunately for us, there are many large models available that are already trained and thus we don't go through this.

The more frequent task of making predictions will be done by the small model, which is significantly more efficient to run that the large model. As a result, knowledge distillation is a particularly popular technique for LLMs in hardware constrained environments, e.g. on mobile devices.

To give you an idea about the hardware requirements for running a large LLMs: serving a single 175 billion LLM requires at least 350GB of GPU memory using specialized infrastructure. Unsurprisingly such computational requirements are out of reach for almost all professionals and almost all companies.

Fine-tuning with Knowledge Distillation

Fine-tuning is a technique used to improve the performance of a pre-trained model on a specific task. The idea is to take a pre-trained model and then train it further on a new task with task-specific data. Fine-tuning is a popular technique where pre-trained models are fine-tuned on specific tasks such as sentiment analysis, question answering, etc.

Now, Fine-tuning with knowledge distillation is a technique that combines the benefits of both fine-tuning and knowledge distillation. The idea is to take a pre-trained model and then train a smaller model to mimic the pre-trained model using task-specific data. The smaller model is then fine-tuned on the specific task.

Synthetic Data for Fine-tuning

As always, data is crucial for all efforts in the LLM realm. In a nutshell: the more data, the better. But what happens if you don't have the data? Then you must generate authentic synthetic data. But how? Enter 𝗳𝗶𝗻𝗲𝘁𝘂𝗻𝗶𝗻𝗴 𝘄𝗶𝘁𝗵 𝗱𝗶𝘀𝘁𝗶𝗹𝗹𝗮𝘁𝗶𝗼𝗻. In the process, you use a large model (e.g., GPT-4V) as teacher to generate synthetic data for your (much) smaller LLM (e.g., Falcon 7B) - the student.

In the example below, I'll show you the steps how to generate a Q&A fine-tuning dataset for a hypothetical stock market advisor.

Step 1: Manually generate a few input examples

First, you have to come up with some real-world examples. The have the following structure:

- 𝘶𝘴𝘦𝘳_𝘤𝘰𝘯𝘵𝘦𝘹𝘵: describe the situation and give the LLM an idea who it is dealing with (e.g., "I am a 35-year-old business owner")
- 𝘲𝘶𝘦𝘴𝘵𝘪𝘰𝘯: describe the intention (e.g., "Is Dogcoin a good investment option?")

Step 2: Expand the input examples with the teacher LLM

Here comes the first crucial step. Use the teacher LLM (e.g., GPT4, etc.) to generate input samples. You should create a reasonable number, 100 or more to start. You will use the manually filled input examples to do few-shot prompting which guides the LLM to give you domain-specific samples:

I will give you a sample prompt with an user context section and a question. Can you generate 100 more examples following the same pattern?

# USER CONTEXT 1
...

# QUESTION 1
...

# USER CONTEXT 2
...

# QUESTION 2
...

Step 3: Use the teacher LLM to generate outputs

After creating the input data, we'll use the teacher LLM for generating the answers for our inputs. You have to iterate through your data and use this prompt:

You are an expert for stock markets. I will give you some user context and you will provide me with a good answer to my question.

# USER CONTEXT
{USER_CONTEXT}

# QUESTION
{QUESTION}

Please provide concrete answer and justify your answer based on the information provided in the user context.

And we're (almost) done. After running through your input data, you have your training data for smaller LLMs. If you want to make sure that the content is good, you should considering hiring a domain expert to check and refine the data. Here's a concrete example of synthetical data from this article from Pau Labarta's RLML newsletter.

Knowledge distillation is a very useful approach to make LLMs work for niche applications with little available data. However, you should be aware that your student model may only imitate the style of the teacher model and does not have reasoning capabilities that your teacher model has. To address this limitation, researchers from Microsoft devised a method where they use tailor made synthetic data that also teaches the reasoning. This innovative strategy holds the promise of bridging the gap between style imitation and the development of genuine reasoning skills in student models.

Conclusion

Equipped with these powerful methods and tools, you are now better positioned to harness the full potential of large language models (LLMs) in specialized applications, even with limited data availability. By leveraging knowledge distillation techniques and utilizing synthetic data that emphasizes both style and reasoning, you can develop student models that are not only stylistically aligned with their teacher models but also capable of robust reasoning. This advancement empowers you to create more sophisticated, effective, and tailored AI solutions that meet specific needs and challenges in various niche domains.