Types of Fine Tuning

Let us explore various types of fine-tuning methods.

Supervised fine-tuning

  • Supervised fine-tuning takes a pre-trained model and trains it further on a task-specific dataset with labeled examples. This task-specific dataset includes input-output pairs, where the model learns to map inputs to corresponding outputs.
  • Process:
    • Take a pre-trained – model.
    • Prepare the dataset in the format of input and output pair as expected by the model.
    • Train the model – The pre-trained weights are adjusted during fine-tuning to adapt the model to the specific task.
  • Use Cases:

Instruction fine-tuning

  • Instruction fine-tuning is a type of fine-tuning in which the input-output examples are further augmented with instructions in the prompt template, which enables instruction-tuned models to generalize more easily to new tasks.
  • Process
    • Take a pre-trained model.
    • Prepare data set. For instruction fine-tuning, we need to prepare the data in the form of an instruction response pair.
    • We train the model using the instruction fine-tuning. This training process is the same as training a neural network.
  • Use cases.
    • Instruction fine-tuning is generally used where we need the model to behave like a chatbot example question answering.

PEFT methods

Training a full model is generally challenging. A model with 1 B parameters will generally take 12-15 times memory for training. During training, we need extra memory for gradients, optimizer stares, activation, and temp memory for variables. Hence the maximum size of a model that can be fit on a 16 GB memory is 1 billion. Model beyond this size needs higher memory resulting in high compute cost and other training challenges.

To efficiently train large models on small compute resources we have PEFT methods which stand for Parameter efficient fine tuning. This method does not update all the weights of the model thereby reducing the memory requirements significantly. PEFT can further be classified as

1. Selective Method

In the selective method, we freeze most of the model’s layers and unfreeze only selective layers. We train and modify the weights of this selective layer to adapt to our specific task. This method is generally not used.

2. Reparameterization Method

This is the most common method. It reparameterizes model weights with low-rank matrices. This is also known as LoRA (Low-RAnk matrices). We keep the model weights frozen. Instead, we inject the small new trainable parameters with low-dimension matrices.

Example

  • Let’s say we have a model with a dimension of d* k = 512* 64 (where d is the dimension or token length, and k is the embedding dimension)
  • Now if we use the standard fine-tuning we would be updating 512*64 = 32768 parameters
  • With LoRA we take two matrices of low rank. Let that rank be 8. So we take two matrices A and B such that the size of A is 8*64 and the size of B is 512*8. So B*A size is 512*64.
  • We train the weights of A and B matrices instead of the model weights. We multiply these two matrices and add them to the model weights.
  • The total number of parameters comes to be 512 + 4096 = 4608 which is much less compared to the number of parameters required for full fine-tuning

QLoRA – It’s a further extension of the LoRA method. Here we further optimize memory requirements by quantizing our weights. Normally we use 32 bytes for storing model weights and other parameters while model training. Using quantizing methods we can use 16 bytes for storing model weight and parameters. This results in loss of precision but considerably reduces the memory.

3. Additive Method

Adaptive method – In the adaptive method we add new layers either in the encoder or decoder side of the model and train this new layer for our specific task.

Soft prompting – There is also a method of soft prompting or prompt tuning where we add new trainable tokens to the model prompt. These new tokens are trained while all other tokens and model weights are kept frozen. Only the newly added tokens are trained.

RLHF

RLHF stands for Reinforcement Learning Human Feedback. It is used to align a model to generate output that is preferred for human consumption.

RLHF is generally used after fine-tuning. It takes a fine-tuned model and aligns its output concerning human preference. The RLHF method uses the concept of reinforcement learning to align the model.

RLHF has below steps

  • Prepare dataset. We need to prompt our fine-tuned model to generate different completions. These prompt completion Pairs are ranked by human evaluators on the alignment criteria. This is the most critical and time-consuming step in RLHF
  • Train Reward Model – Based on the prepared dataset we prepare a reward model that will output a good score for preferred completion and a low score for unpreferred completion.
  • Update Model – Once the reward model is ready, we can use the RL algorithm to further update the weights of our fine-tuned model. Generally, the PPO algorithm is used as it has been shown to perform well.

Fine Tuning Large Language Model (LLM)

Large Language Models (LLMs) have revolutionized the natural language processing by excelling in tasks such as text generation, translation, summarization and question answering. Despite their impressive capabilities, these models may not always be suitable for specific tasks or domains due to compatibility issues. To overcome this fine tuning is performed. Fine tuning allows the users to customize pre-trained language models for specialized tasks. This involves refining the model on a limited dataset of task-specific information, enhancing its performance in that particular task while retaining its overall language proficiency.

Table of Content

  • What is Fine Tuning?
  • Why Fine-tune?
  • Types of Fine Tuning
  • Prompt Engineering vs RAG vs Fine tuning.
  • When to use fine-tuning?
  • How is fine-tuning performed?
  • Fine Tuning Large Language Model Implementation

Similar Reads

What is Fine Tuning?

Fine-tuning LLMs means we take a pre-trained model and further train it on a specific data set. It is a form of transfer learning where a pre-trained model trained on a large dataset is adapted to work for a specific task. The dataset required for fine-tuning is very small compared to the dataset required for pre-training....

Why Fine-tune?

The pre-trained model or foundation model is trained on a large amount of data using self-supervised training and it learns the statistical representation of language very well. It has learned how to be a good reasoning engine. With fine-tuning we...

Types of Fine Tuning

Let us explore various types of fine-tuning methods....

Prompt Engineering vs RAG vs Fine tuning.

Let us explore the difference between prompt engineering, RAG, and fine-tuning....

When to use fine-tuning?

When we build an LLM application the first step is to select an appropriate pre-trained or foundation model suitable for our use case. Once the base model is selected we should try prompt engineering to quickly see whether the model fits our use case realistically or not and evaluate the performance of the base model on our use case....

How is fine-tuning performed?

There is no standard way of fine-tuning as there are many methods and the fine-tuning steps depend on the task objective at hand. However, it can be generalized to have the below steps:...

Fine Tuning Large Language Model Implementation

Let us fine tune a model using PEFT LoRa Method. We will use flan-t5-base model and DialogSum database. Flan-T5 is the instruction fine-tuned version of T5 release by Google. DialogSum is a large-scale dialogue summarization dataset, consisting of 13,460 (Plus 100 holdout data for topic generation) dialogues with corresponding manually labeled summaries and topics....

Conclusion

...