Sheet 4.1 Supervised fine-tuning and RL fine-tuning

Sheet 4.1 Supervised fine-tuning and RL fine-tuning#

Author: Polina Tsvilodub

This sheet provides an overview of different flavours of fine-tuning of LLMs and their respective use cases. Particular focus is provided for RL based fine-tuning, and specifically, RLHF (reinforcement learning from human feedback).

The key learning goals for this sheet are:

be able to identify the type of fine-tuning used for a particular model and explain it conceptually
gain a practical understanding of RLHF components, in particular:
- creation and use of a reward model
- training steps for the policy.

Flavours of fine-tuning#

In sheet 2.5, we already brushed over the distinction between pretrained and fine-tuned models. In particular, fine-tuning was introduced as the training LMs on task-specific datasets, e.g., for movie review generation or classification. For classification, LMs are usually fine-tuned with a classification head (see sheet 3.2 for a recap). However, from here on, we will focus on fine-tuning of model for generative tasks, i.e., simply fine-tuning LMs for next-word prediction on a specific task.

We have seen fine-tuning of a generative LM for question answering in homework 1. Here, the specific task the model is supposed to learn is constituted by the specific dataset and the formatting of the questions and the answers. The model was trained on examples of questions with correct answers; i.e., this is supervised fine-tuning where the model was shown what the desired behavior is (i.e., which next tokens it is supposed to predict).

For state-of-the-art LLMs, it has been identified that there is a particular kind of task that general-purpose LLMs, and in particular assistants, should be able to complete: namely, instruction-following. That is, rather than creating LMs that can only do QA on a specific dataset with a particular formatting, the community started to build instruction-tuned LLMs, which generate sensible outputs given user instructions ranging from “Provide ten recipes with tofu in a bullet list format” to “Summarize the following scientific paper”. This has also been achieved with supervised fine-tuning, where the example input and output pairs in the dataset consist of example instructions, and their respective completions.

Exercise 4.1.1: Supervised-finetuning

Take a look at this dataset. For which kind of fine-tuning is it intended? What kinds of examples are there?

Consider the following use cases for which you want to build an LM: (a) an assistant tutor model for students for different subjects, (b) a model for answering highly specific questions about a medical knowledge base, (c) a model intended for writing abstracts of scientific papers. What kind of fine-tuning set up would you consider (i.e., what kind of fine-tuning dataset would you ideally choose)?

Please come up with a prompt for testing whether an LM can follow instructions. Use the following code to test the instruction-following performance of an instruction-tuned model (Qwen-2.5-7B) and a simple small LM (GPT-2). Feel free to play around with the decoding parameters! (WARNING: the instruction-tuned model is a large model, so it can take a moment to load on Colab. Please also be aware of it if you execute the notebook locally!)

Click below to see the solution

The dataset contains generative fine-tuning examples on a broad range of tasks, from calculations to writing stories to cooking recipes.

The assistant tutor model has to be finetuned such that it can have natural-sounding conversations with students on a range of subjects, i.e., we need dialogues from a teaching context as data for training. For the medical knowledge base, it is probably necessary to implement an integrated retrieval system that allows the language model to access the database, rather than just generating everything from next-token predictions. For writing abstracts of scientific papers, we have to optimise the model on summarisation, possibly already using scientific papers in the training.

You can use something like “Please write a sentence where every word begins with “a”. Answer: “. In this case, the output was “Alice admired” for the instruction-tuned model (which fulfills the instruction), but “”A” is a noun” for GPT-2, which does not.

# !pip install accelerate bitsandbytes

# import packages

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer_instruct = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
model_instruct = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B-Instruct",
    load_in_4bit=True,
    device_map="auto",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

tokenizer_lm = AutoTokenizer.from_pretrained("openai-community/openai-gpt")
model_lm = AutoModelForCausalLM.from_pretrained("openai-community/openai-gpt")

instruction_text = #### YOUR TEXT HERE ####
input_ids_instruct = tokenizer_instruct.encode(instruction_text, return_tensors="pt").to(device)
input_ids_lm = tokenizer_lm.encode(instruction_text, return_tensors="pt").to(device)

prediction_instruct = model_instruct.generate(
    input_ids_instruct
)
print("Instruction-tuned model's prediction: ", tokenizer_instruct.decode(prediction_instruct[0], skip_special_tokens=True))

prediction_lm = model_lm.generate(
    input_ids_lm
)
print("GPT-2's prediction: ", tokenizer_lm.decode(prediction_lm[0], skip_special_tokens=True))

The distinctions above focused on distinctions in the content of the fine-tuning, i.e., the content of the input-output demonstrations in the datasets used for the supervised fine-tuning.

Additionally, the lecture introduced different methods of efficient supervised fine-tuning, which is especially important for large LMs that take a lot of resources to train. The QA fine-tuning that we did in homework 1 was naive fine-tuning. That is, during the fine-tuning, all parameters were updated. However, as explained in the lecture, the more common state-of-the-art approach to fine-tuning is parameter-efficient, i.e., only a selected subset of the pretrained model parameters, or a small set of new parameters is updated.

The following code provides an simple example of vanilla selective fine-tuning where only the last transformer block and the last layer (i.e., LM head) of GPT-2 would be finetuned, and all other layers are frozen (❄️). Concretely, this means that we don’t want to compute gradients of parameters that are frozen, and we do not want to change their values. Usually parameters are (un)frozen by-layer / component.

Of course, the same approach can be used for (un)freezing any other subset of layers, in any other transformers model. For this, it is useful to know how to inspect and access different components of a pretrained model, as was briefly shown in sheet 3.1.

Optionally, you can reuse the code from the homework to fine-tune this partially frozen model on the QA task from the homework. Do your results change?

# first we inspect the model's configuration
print(model_lm)

# first, we can inspect the model's configuration and named parameters
for name, _ in model_lm.named_parameters():
    print(name)

# next, we define which layers NOT to freeze 
# (of course, we can do this vice versa and define which layers to freeze)

layers_to_unfreeze = ["transformer.h.11", "transformer.ln_f.weight", "transformer.ln_f.bias"]

# iterate over model's parameters
# note that, by default, in train mode, all parameters are set to require grad = True (i.e., unfrozen)
for name, param in model_lm.named_parameters():
     # check that these parameters are not in the layers_to_unfreeze list
     if all([not name.startswith(n) for n in layers_to_unfreeze]): 
        # if not, freeze these parameters
        param.requires_grad = False

# now we check how many parameters are trainable
params = [p for p in model_lm.parameters() if p.requires_grad]
print(f'The model has {sum(p.numel() for p in params):,} trainable parameters')

Exercise 4.1.2: PEFT

Compare the number above with the number of trainable parameters in the original model. What changed? How do you expect this to affect fine-tuning results?

Suppose that we wanted to use rank \(r=4\) LoRA for fine-tuning the decoder self-attention block of GPT-2. How many parameters would the lower rank matrices A and B have (see slides 27-29 for reference)?

Click below to see the solution

Number of trainable parameters before freezing: 124 million, after freezing: 7 million. Although intuitively, one might think that the performance will be worse when fewer parameters are fine-tuned, PEFT has turned out to be efficient and even have further advantages over fine-tuning all parameters, e.g. with regard to catastrophic forgetting.

The attention weight matrix has a size of 768×2304, so the lower rank matrices have the dimensions 768×4 and 4×2304.

LoRA fine-tuning in practice#

As for full supervised fine-tuning that we have already seen, transformers provides high-level wrappers for conducting LoRA finetuning. Below is an example of setting up the fine-tuning of a 7B model, focusing on the parts that are specific to LoRA. Other parts of the training pipeline (e.g., loading and preparing the data) would be used as usual.

from peft import LoraConfig, get_peft_model

# first, we have to create a config object
config = LoraConfig(
    r=64, # LORA rank; hyperparam, subject to change 
    lora_alpha=16, # alpha/rank is used; higher value will lead to higher weight of lora params over frozen params 
    lora_dropout=0.1, 
    bias="none", 
    task_type="CAUSAL_LM",
    target_modules=[ # this depends on the model and what you want to fine-tune; if set to "all-linear", performance might be similar to QLoRA fine-tuning
        "q_proj",  # all Linear layers for OLMo
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
        "lm_head",
    ],
)

# next, we have to create PEFT model wrapper around a base model we want to finetune
base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
lora_model = get_peft_model(base_model, config)


# next, we would set up TrainingArguments and Trainer as usual (see shett 2.5)
training_args = TrainingArguments(
    ...
)

trainer = Trainer(
    training_args, 
    model=lora_model,
    # lora can make use of more recent or even custom optimizers, such as AdamW8bit that are more efficient and can save memory even more
    optim="paged_adamw_8bit",
    ...
)

from transformers import AutoModelForCausalLM
from peft import PeftModel
# once the model is trained, usually, only the trained adapters are saved.
# To use the full fine-tuned model that has been saved, they have to be loaded and merged back in to the base model.
peft_model_id = "path/to/adapters"
model = PeftModel.from_pretrained(base_model, peft_model_id)
# this function does not keep the adapters separately in the memory and removes latencies at inference time
model.merge_and_unload()
# to keep the adapters separately, we can use the following function
model = PeftModel.from_pretrained(base_model, peft_model_id)
model.merge_adapter()
# unmerge the LoRA layers from the base model
model.unmerge_adapter()

You can find details on all the settings you can change within LoRA configuration here.

Outlook: PEFT in practice#

Below, more optional resources for the other types of fine-tuning introduced in the lecture can be found.

RL fine-tuning#

Reinforcement learning is often introduced a separate type of machine learning, in addition to supervised and unsupervised learning. Reinforcement learning broadly defines the field of study and the methods for training agents to learn to take actions that (optimally) achieve a goal, based on experience in the environment. It can be seen as the computational formalization of trial-and-error learning.

The key difference to supervised learning is that the agent (the terms “model”, “LM” and “agent” will be used interchangeably in this section) learns which actions are useful for achieving the goal itself, rather than being shown the “ground truth” optimal actions as in supervised learning. The task of the developer is, therefore, to correctly specify the goal, and the agent will “discover” a way to achieve it. In the formal framework which underpins RL (namely, MDPs), the goal is represented via the reward function. This function assigns high rewards to desired outcomes, and low rewards to undesired ones, therefore implicitly representing a goal. It is important to note that, in general, the approach of RL allows to specify what the developer want the agent to learn to do (i.e., the goal), but not necessarily how, exactly. Correct specification of the goal is a far from trivial task and a lot of current research goes into understanding how to specify these goals in the field of alignment (more on this in future sessions).

Using RL for fine-tuning LLMs is one of the main methodological innovations that seems to have led to the impressive performance of SOTA LLMs. In particular, RL allows to fine-tune LLMs towards human preferences and commericial usability, because its mechanics lend itself to training a model based on a more abstract signal whether an output is good or bad (i.e., let it discover how to generate output that recevies a “good” reward!), rather than based on particular demonstrations. This is especially useful because the objectives of fine-tuning of SOTA LLMs often include aspects that are very difficult to specify via specific demonstrations, like being helpful, honest, harmless (Bai et al., 2022). To this end, instead of using supervised learning, human feedback can be used as a reward signal to fine-tune the model. This is why the fine-tuning technique in question is called Reinforcement Learning from Human Feedback (RLHF).

Below, practical aspects of the core components of RLHF are discussed. These core components are (see this figure for an overview):

the policy (i.e., the backbone LLM),
the supervised fine-tuning data (SFT) and training,
the reward modeling data and the resulting reward model,
and, finally the RL training objective (commonly, the PPO algorithm) and the dataset for fine-tuning.

As with standard LM training, there are packages which implement some of the steps required for training for us. We will look at the transformers based package trl.

Policy#

As the policy, a pretrained, sufficintly large LM is usually chosen. For instance, Llama offers a suite of models where both the initial LM, i.e., the base model and the resulting fine-tuned model is provided. For Llama-2 (Touvron et al. (2022)), the paper provides some information about the details of the training and the architecture of the models.

Under state-of-the-art models, usually LMs of at least >=1B parameters are used as policies for further fine-tuning. The intuitive reason is that RLHF is mostly used for creating more general-purpose assistants rather than task-specific models, and therefore, the initial model should have rather large capacity and perform well on a wide range of tasks (which, as we know, tends to come with scale). In other words, the base model should be good – otherwise: “Garbage in, garbage out” (i.e., it might be very difficult if not impossible to fix a bad base model through fine-tuning).

Of course, it is absolutely possible to use RLHF for task-specific fine-tuning, e.g., early work fine-tuned a model for summarization.

Supervised fine-tuning#

This step is a “standard” supervised fine-tuning (SFT) step that is performed as we have discussed above. Some RL-tuned models are closed source, so it is unknown whether any of the PEFT techniques is used; others are open-source and might report their fine-tuning approach.

While the specific methodological details might vary, the conceptual point behind SFT is two-fold: (1) models are often instruction-tuned for turning into actually useful assistants and (2) (more importantly) the model is “nudged” towards outputting human-demonstrated (and therefore, human-preferred) texts for the ultimate goal of the fine-tuning (e.g., being helpful, harmless, honest). This makes subsequent RL-tuning more efficient. Intuitively, this is because the space of actions (i.e., any possible completion, given a prompt!) which the agent has to explore is quite giagantic, and the agent might make quite a lot of errors before “stumbling” upon high-reward actions. Through SFT, the agent is already biased towards the higher-reward space of actions. Therefore, the SFT dataset usually consists of examples of target outputs written by human annotators. This step is also often called behavioral cloning.

Exercise 4.1.3: Supervised fine-tuning for RL

What would an SFT dataset look like for fine-tuning a model for summarization?

What apects do you think are important to keep in mind when performing SFT? (Think: what sorts of examples should human annotators see? What should they be instructed to do? Feel free to also use information from the OpenAI blogpost for inspiration)

Below is an example of using the trl library for the SFT step. Please go through the code and make sure you understand it. Look at the docs for the training arguments class which is used by default by the SFTTrainer here. By default, for how many epochs is the model trained?

Click below to see the solution

For summarization, a dataset for supervised fine-tuning would have to include text and good summaries of them.

Aspects that are to be kept in mind are, for example, truthfulness of answers, length of answers, potentially harmful content (although filtering this out is one of the task that RL will help with), and answers should generally make sense (since nonsensical answers will alienate annotators).

By default, the model is trained for 3 epochs.

# uncomment and run on Colab / install the package in your enironment if you haven't yet
# !pip install trl

from datasets import load_dataset
from trl import SFTTrainer, SFTConfig

# load dataset
# you can inspect it to get a sense of its contents and formatting
dataset = load_dataset("CarperAI/openai_summarize_tldr", split="valid")
# load base model
model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")

# define a function that formats the prompts
# NOTE: the formatting of examples is extremely important for successful training and deployment for a given task
# for instance, the use of special tokens and prompt formatting should be consistent with the task, the model (if it already uses special tokens)
# and should be used in consistent ways in further training

def formatting_prompts_func(example):
    # This can be any formatting function that corresponds to the desired task
    return {"text": f"Text: {example['prompt']}\n Summary: {example['label']}"}

# apply the formatting function to the dataset
formatted_dataset = dataset.map(formatting_prompts_func).rename_column("label", "completion")

# set up the training config
sft_config = SFTConfig(
    per_device_train_batch_size=8,
    max_seq_length=256,
    # completion_only_loss=False,
    dataset_text_field="text",
    report_to="none",
    logging_steps=1,
)

trainer = SFTTrainer(
    args=sft_config,
    model=model,
    train_dataset=formatted_dataset,
)

trainer.train()

# If one wants to pass custom training arguments, an example of doing so is here: https://github.com/huggingface/trl/blob/main/examples/scripts/sft.py

Reward modeling#

A core component of RL fine-tuning is the reward model. There are various ways of obtaining a reward model. The perhaps most intuitive one, as discussed in the lecture, is obtaining human feedback as the reward signal. Since having humans give online feedback to an LM for thousands of samples is costly and cumbersome, instead, a reward model is trained on human preferences. For training such a reward model, human annotations are collected. As discussed in the lecture, these annotations can take different forms.

Below, we will look at an example based on simple binary comparisons, where human annotators were shown two outputs of the LM sampled for the same input \(x\) and had to indicate which of them they preferred (e.g., output \(y_1\) over \(y_2\)). Specifically, the reward model is usually initialized from a pretrained LLM (maybe even the same one as the base LM of the policy), and fine-tuned with a specific head to output numerical scores. It is usually trained to maximize the difference in predicted scores scores \(r\) between the preferred and the rejected answer in a pair. Formally, the model is trained with the following loss:

\[L(\theta) = -\frac{1}{N}\mathbb{E}_{(x, y_1, y_2) \sim D} [log \; (\sigma (r_{\theta} (x, y_1) - r_{\theta}(x, y_2)))] \]

This form of human annotations has proven especially useful for eliciting human intuitions for such difficult-to-capture concepts like “helpfulness”. It is much easier for humans to decide which of two options they prefer than, e.g., to consistently assign scalar 1-10 scores to outputs.

Exercise 4.1.4: Reward modeling

Consider this well-known dataset from Anthropic for training helpful and harmless assistants. Pick a specific sample. For this sample, which text corresponds to \(x, y_1, y_2\)?

An example of a model trained with the approach above can be found here. Feel free to explore the repository, if you want. We will not test this model since it is quite large.

Click below to see the solution

\(x\) is the input and \(y_1, y_2\) are the outputs (whose nature depends on the chosen sample).

However, there are also alternative ways of providing rewards. For instance, some work has used other models to provide feedback, an approach that is called [RLAIF](https://arxiv.org/pdf/2212.08073) (RL from AI feedback).
Alternatively, more task-specific reward models can be constructed. For instance, if one wants to use RL-tuning for training a summarization model, the score for evaluating summaries (ROUGE) can be used as the numerical reward. If one wants to train a model for generating positive-sentiment texts only, one can train a reward model on labelled sentiment classification data like, e.g., the IMDB dataset.

> <strong><span style=&ldquo;color:#D83D2B;&rdquo;>Exercise 4.1.5: Task-specific reward modeling</span></strong>
> 1. The code below provides an example loading a trained reward model. This model was trained on movie reviews, in particular to assign high scores to positive reviews and low scores to negative reviews, as described above. Please look at the code and make sure to understand it. Test it on a few of your own intuitive examples; do the scores (ordinally) correspond to your intuition?

from datasets import load_dataset
from transformers import (
    AutoTokenizer, 
    AutoModelForSequenceClassification,
)
import torch

# Load reward model and its tokenizer
reward_tokenizer = AutoTokenizer.from_pretrained("lvwerra/distilbert-imdb")
reward_model = AutoModelForSequenceClassification.from_pretrained("lvwerra/distilbert-imdb")

# Run an example from the IMDB train split to see how the reward model works
positive_sentence = "" #### YOUR EXAMPLE HERE ####
negative_sentence = "" #### YOUR EXAMPLE HERE ####

input_pos = reward_tokenizer(positive_sentence, return_tensors='pt')
input_neg = reward_tokenizer(negative_sentence, return_tensors='pt')

reward_pos = reward_model(**input_pos)
reward_neg = reward_model(**input_neg)

print("Reward for positive sentence: ", reward_pos)
print("Reward for negative sentence: ", reward_neg)

PPO training#

Once the SFT model and the reward model are available, the final step is to fine-tune the SFT model with RL. For this, a dataset to fine-tune on is needed again. Depedning on the model, sometimes the same dataset as for SFT used, sometimes a similar dataset with other inputs is used. Note that now the dataset doesn’t need any labels, but only the inputs (i.e., initial “states”) based on which the model will generate predictions (i.e., actions) which, in turn, will be assigned rewards (by the reward model which represents the environment).

There are different algorithms for training the policy. One currently common choice is the Proximal Policy Optimization (PPO). Its components were developed in order to stabilize policy updates and speed up convergence. However, there is a core shared idea between PPO and other algorithms in the category of policy-gradient methods. Specifically, the theorem behind this class of methods shows that the policy can be trained with a loss based on “trials and errors” (i.e., based on sampling; this objective has been shown to update the policy in such a way that, by following it, the agent is expected to receive higher rewards). Specifically, for a training step, we can sample actions, retrieve their log probability under the current policy, get rewards for these actions, and calculate weight updates based on the product of log probability and the reward.

Note that choosing hyperparameters for successful RL fine-tuning is very important. While there are no proven results, the community best practice seems to indicate that the batch size should be relatively large, and the training should be quite short (e.g., only one epoch) to avoid undesired effects. Some of these details can be found, e.g., in the Llama-2 report.

Exercise 4.1.6: RL training

Suppose you train an LM for summarization and have a suitable reward model and dataset of input texts. Consider the last sentence of the explanation above. Please provide a specific details of such a training step with this summarization example (in words).

An example of using the trl library for training a model to generate positive model reviews (using the reward model above!) with PPO can be found here. Please look at the code and try to understand all of it! Please ask questions or research if anything is unclear, this approach might be relevant for your own future exercise ;)

Click below to see the solution

For reference about typical hyperparameter settings: This is what the Llama report tells about RLHF training details: “Training Details. We train for one epoch over the training data. In earlier experiments, we found that training longer can lead to over-fitting. We use the same optimizer parameters as for the base model. The maximum learning rate is 5 × 10−6 for the 70B parameter Llama 2-Chat and 1 × 10−5 for the rest. The learning rate is decreased on a cosine learning rate schedule, down to 10% of the maximum learning rate. We use a warm-up of 3% of the total number of steps, with a minimum of 5. The effective batch size is kept fixed at 512 pairs, or 1024 rows per batch.“

Optional outlook#

RL fine-tuning is an active area of research, so there are many developments and new methods. Below are some optional resources if you want to know more.

Sheet 4.1 Supervised fine-tuning and RL fine-tuning

Contents

Sheet 4.1 Supervised fine-tuning and RL fine-tuning#

Flavours of fine-tuning#

LoRA fine-tuning in practice#

Outlook: PEFT in practice#

RL fine-tuning#

Policy#

Supervised fine-tuning#

Reward modeling#

PPO training#

Optional outlook#