Homework 1: Language models (58 points)

Homework 1: Language models (58 points)#

The first homework focuses on the following skills: being able to work with concpetual & formal exercises on language modeling and neural networks, understanding configurations of state-of-the-art language models and, finally, fine-tuning a language model yourself!

Logistics#

submission deadline: May 13th 23:59 German time via Moodle
- please upload a SINGLE ZIP FILE named Surname_FirstName_HW1.zip containing the .ipynb file of the notebook (if you solve it on Colab, you can go to File > download).
please make sure to KEEP the outputs of your notebook cells where needed, so that we can inspect them.
please solve and submit the homework individually!
if you use Colab, to speed up the execution of the code on Colab (especially Exercise 3), you can use the available GPU (if Colab resources allow). For that, before executing your code, navigate to Runtime > Change runtime type > GPU > Save.

Exercise 1: Understanding language modeling (12 points)#

Please answer the following exercises. Importantly, please reason step by step; i.e., where calculations are required, please provide intermediate steps of how you arrived at your solution. You do not need to write any code, just mathematical solutions.

[6pts] Consider the corpus \(C\) with the following sentences: \(C=\){“The cat sneezes”, “The bird sings”, “The cat sneezes”, “A dog sings”}. (a) Define the vocabulary \(V\) of this corpus (assuming by-word tokenization). (b) Pick one of the four sentences in \(C\). Formulate the probability of that sentence in the form of the chain rule. Calculate the probability of each termn in the chain rule, given the corpus (assuming that there is, additionally, a start-of-sequence and an end-of-sequence token).

[4pts] We want to train a neural network that takes as input two numbers \(x_1, x_2\), passes them through three hidden linear layers, each with 13 neurons, each followed by the ReLU activation function, and outputs three numbers \(y_1, y_2, y_3\). Write down all weight matrices of this network with their dimensions. (Example: if one weight matrix has the dimensions 3x5, write \(M_1\in R^{3\times5}\))

[2pts] Consider the sequence: “Input: Some students trained each language model”. Assuming that each word+space/punctuation corresponds to one token, consider the following token probabilities of this sequence under some trained language model: \(p = [0.67, 0.91, 0.83, 0.40, 0.29, 0.58, 0.75]\). Compute the average surprisal of this sequence under that language model. [Note: in this class we always assume the base \(e\) for \(log\), unless indicated otherwise. This is also usually the case throughout NLP.]

Exercise 2: Understanding LLM configuration (8 points)#

For this task, your job is to understand the configrations of a state-of-the-art transformer, provided in a config.json file for allowing to initialize a transformer through the function AutoModelForCausalLM.from_pretrained() witin the transformers library. This file contains meta-information about the parameter configurations of the transformer.

Your task is to:

explain what each line of the following config provides. Please write a commend above the line explaining what the following parameter is.
modify the config so that the transformer would use a context window size of 1024, 12 attention heads, and a ReLU activation function.

{
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 100257,
  "eos_token_id": 100257,
  "hidden_act": "silu",
  "hidden_size": 5120,
  "initializer_range": 0.02,
  "intermediate_size": 27648,
  "max_position_embeddings": 4096,
  "num_attention_heads": 40,
  "num_hidden_layers": 64,
  "num_key_value_heads": 8,
  "pad_token_id": 100277,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 500000,
  "tie_word_embeddings": false,
  "torch_dtype": "float32",
  "use_cache": true,
  "vocab_size": 100352
}

Exercise 3 (15 points):#

In the lecture, we have extensively covered a core component of the transformer – the self-attention calculation in the forward pass. In this exercise, your task is to perform the forward pass steps 1-6.i (i.e., up to, excluding, the forward step) from the exercise sheet assuming that the transformer has a second attention heads, where the second attention head has the following weight matrices:

Q_2 = [[0.5, 1, 1], [2, 0, 0.2], [3, 2, 0]]
K_2 = [[0.1, 0.5, 1], [0.5, 1, 1], [2, 2, 2]]
V_2 = [[1, 0.1, 0.3], [0, 3, 0.5], [1, 1, 1]]

Your task is to submit a solution calculating the contextualized representations of this second attention head. Please make sure to include all the intermediate calculation steps and answer the following question:

How does the memory load of running inference with the transformer scale with the number of attention heads?

You can submit a picture / scan of a hand-written, or type it in TeX – up to you.

Exercise 4: Fine-tuning Pythia for Question Answering (23 points)#

The learning goal of this exercise is to practice fine-tuning a pretrained LM, Pythia-160M, for a particular task, namely commonsense question answering (QA). We will use a task-specific dataset, CommonsenseQA, that was introduced by Talmor et al. (2018). We will evaluate the performance of the model on our test split of the dataset over training to monitor whether the model’s performance is improving and compare the performance of the base pretrained Pythia model and the fine-tuned model. We will need to perform the following steps:

Prepare data according to steps described in sheet 1.1
1. additionally to these steps, prepare a custom Dataset (like in sheet 2.3) that massages the dataset from the format that it is shipped in on HuggingFace into strings that can be used for training. Some of the procesing steps will happen in the Dataset.
Load the pretrained Pythia-160m model
Set up training pipeline according to steps described in sheet 2.5
Run the training for 200 steps, while tracking the losses. This number of steps should be sufficient for being able to tell that your training is working in principle.
Save plot of losses for submission

Your tasks:

[19pts] Complete the code in the spots where there is a comment “#### YOUR CODE HERE ####”. There are instructions in the comments as to what the code should implement. With you completed code, you should be able to let the training run without errors. Note that the point of the exercise is the implementation; we should not necessarily expect great performance of the fine-tuned model (and the actual performance will not be graded). Often there are several correct ways of implementing something. Anything that is correct will be accepted.

[4pts] Answer questions at the end of the execise.

# note: if you are on Colab, you might need to install some requirements
# as we did in Sheet 1.1. Otherwise, don't forget to activate your local environment

from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, GPT2Tokenizer, GPT2LMHeadModel, DataCollatorForLanguageModeling
import torch
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm

/opt/anaconda3/envs/understanding_llms/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

# additioanlly, we need to install accelerate
# uncomment and run the following line on Colab or in your environment
# !pip install accelerate
# NOTE: in a notebook, reloading of the kernel might be required after installation if you get dependency errors with the transformers package

### 1. Prepare data with data prepping steps from sheet 1.1

# a. Acquiring data
# b. (minimally) exploring dataset
# c. cleaning / wrangling data (combines step 4 from sheet 1.1 and step 1.1 above)
# d. splitting data into training and validation set (we will not do any hyperparameter tuning) 
# (we don't need further training set wrangling)
# e. tokenizing data and making sure it can be batched (i.e., conversted into 2d tensors)
# this will also happen in our custom Dataset class (common practice when working with text data)

# downaload dataset from HF
dataset = load_dataset("tau/commonsense_qa")

# inspect dataset
print(dataset.keys())
# print a sample from the dataset
### YOUR CODE HERE ####

dict_keys(['train', 'validation', 'test'])

Note that the test split provided with the dataset does not have ground truth answer labels. Therefore, we will only use the validation split to asssess the performance of our model.

# load tokenizer
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-160m")
tokenizer.pad_token = tokenizer.eos_token
# set padding side to be left because we are doing causal LM
tokenizer.padding_side = "left"

def massage_input_text(example):
    """
    Helper for converting input examples which have 
    a separate qquestion, labels, answer options
    into a single string.

    Arguments
    ---------
    example: dict
        Sample input from the dataset which contains the 
        question, answer labels (e.g. A, B, C, D),
        the answer options for the question, and which 
        of the answers is correct.
    
    Returns
    -------
    input_text: str
        Formatted training text which contains the question,
        the forwatted answer options (e.g., 'A. <option 1> B. <option 2>' etc)
        and the ground truth answer.
    """
    # combine each label with its corresponding text
    answer_options_list = list(zip(
        example["choices"]["label"],
        example["choices"]["text"]
    ))
    # join each label and text with . and space
    answer_options = ### YOUR CODE HERE ####
    # join the list of options with spaces into single string
    answer_options_string = ### YOUR CODE HERE ####
    # combine question and answer options
    input_text = example["question"] + " " + answer_options_string
    # append the true answer with a new line, "Answer: " and the label
    input_text += "\nAnswer: " + example["answerKey"]

    return input_text

# process input texts of train and validation sets
massaged_datasets = dataset.map(
    lambda example: {
        "text": massage_input_text(example)
    }
)

# inspect a sample from our preprocessed data
massaged_datasets["train"][0]

def tokenize(tokenizer, example):
    """
    Helper for pre-tokenizing all examples.
    """
    tokenized = tokenizer(
        example["text"], 
        # we are fixing the length to 64 tokens to avoid memory issues
        max_length=64,
        padding="max_length", 
        truncation=True,
        return_tensors="pt",
    )
    return tokenized

tokenized_dataset = massaged_datasets.map(
    lambda example: ### YOUR CODE HERE #### 
    batched=True,
    remove_columns= ### YOUR CODE HERE #### 
)

# move to accelerated device 
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"Device: {device}")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
    print(f"Device: {device}")
else:
    device = torch.device("cpu")
    print(f"Device: {device}")

# 2. init model

# load pretrained Pythia-160M for HF
model = ### YOUR CODE HERE ####
# print num of trainable parameters
model_size = sum(t.numel() for t in model.parameters())
print(f"Pythia-160m size: {model_size/1000**2:.1f}M parameters")

Hint: If you run out of memory while trying to run the training, try decreasing the batch size.

# 3. set up configurations required for the training loop

# instantiate tokenized train dataset 
train_dataset = ### YOUR CODE HERE ####

# instantiate tokenized validation dataset
validation_dataset = ### YOUR CODE HERE ####,
    
# instantiate a data collator
collate_fn = DataCollatorForLanguageModeling(
    tokenizer=### YOUR CODE HERE ####, 
    mlm=False
)
# create a DataLoader for the dataset
# the data loader will automatically batch the data
# and iteratively return training examples (question answer pairs) in batches
dataloader = DataLoader(
    train_dataset, 
    batch_size=16, 
    shuffle=True,
    collate_fn=collate_fn,
)
# create a DataLoader for the test dataset
# reason for separate data loader is that we want to
# be able to use a different index for retreiving the test batches
# we might also want to use a different batch size etc.
validation_dataloader = DataLoader(
    validation_dataset, 
    batch_size=16, 
    shuffle=True,
    collate_fn=collate_fn
)

# 4. run the training of the model
# Hint: for implementing the forward pass and loss computation, carefully look at the exercise sheets 
# and the links to examples in HF tutorials.

# put the model in training mode
model.train()
# move the model to the device (e.g. GPU)
model = model.to(device)

# trianing configutations 
# feel free to play around with these
epochs  = 1
train_steps =  ### YOUR CODE HERE ###
print("Number of training steps: ", train_steps)
# number of validation steps to perform every 10 training steps
# (smaller than the entire validation split for reasons of comp. time)
num_test_steps = 5

# define optimizer and learning rate
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-4) 
# define some variables to accumulate the losses
losses = []
validation_losses = []

# iterate over epochs
for e in range(epochs):
    # iterate over training steps
    for i in tqdm(range(train_steps)):
        # get a batch of data
        x = next(iter(dataloader))
        # move the data to the device (GPU)
        x = ### YOUR CODE HERE ####

        # forward pass through the model
        ### YOUR CODE HERE ###
        outputs = model(
            ### YOUR CODE HERE ####
        )
        # get the loss
        loss = ### YOUR CODE HERE ####
        # backward pass
        ### YOUR CODE HERE ####
        losses.append(loss.item())
        # update the parameters of the model
        ### YOUR CODE HERE ###

        # zero out gradient for next step
        ### YOUR CODE HERE ####

        # evaluate on a few steps of validation set every 10 steps
        if i % 10 == 0:
            print(f"Epoch {e}, step {i}, loss {loss.item()}")
            # track test loss for the evaluation iteration
            val_loss = 0
            for j in range(num_test_steps):
                # get test batch
                x_test = next(iter(validation_dataloader))
                x_test = x_test.to(device)
                with torch.no_grad():
                    test_outputs = model(
                        ### YOUR CODE HERE ####
                    )
                val_loss += ### YOUR CODE HERE ####
                
            validation_losses.append(val_loss / num_test_steps)
            print("Test loss: ", val_loss/num_test_steps)

# 5. Plot the fine-tuning loss and MAKE SURE TO SAVE IT AND SUBMIT IT

# plot training losses on x axis
plt.plot(### YOUR CODE HERE ####)
plt.xlabel("Training steps")
plt.ylabel("Loss")

# print a few predictions on the eval dataset to see what the model predicts

# construct a list of questions without the ground truth label
# and compare prediction of the model with the ground truth

def construct_test_samples(example):
    """
    Helper for converting input examples which have 
    a separate qquestion, labels, answer options
    into a single string for testing the model.

    Arguments
    ---------
    example: dict
        Sample input from the dataset which contains the 
        question, answer labels (e.g. A, B, C, D),
        the answer options for the question, and which 
        of the answers is correct.
    
    Returns
    -------
    input_text: str, str
        Tuple: Formatted test text which contains the question,
        the forwatted answer options (e.g., 'A. <option 1> B. <option 2>' etc); 
        the ground truth answer label only.
    """

    answer_options_list = list(zip(
        example["choices"]["label"],
        example["choices"]["text"]
    ))
    # join each label and text with . and space
    answer_options = ### YOUR CODE HERE ####
    # join the list of options with spaces into single string
    answer_options_string = ### YOUR CODE HERE ####
    # combine question and answer options
    input_text = example["question"] + " " + answer_options_string
    # create the test input text which should be:
    # the input text, followed by the string "Answer: "
    # we don't need to append the ground truth answer since we are creating test inputs
    # and the answer should be predicted.
    input_text += ### YOUR CODE HERE ####

    return input_text, example["answerKey"]

test_samples = [construct_test_samples(dataset["validation"][i]) for i in range(10)]
test_samples

# Test the model 

# set it to evaluation mode
model.eval()

predictions = []
for sample in test_samples:
    input_text = sample[0]
    input_ids = tokenizer(input_text, return_tensors="pt").to(device)
    output = model.generate(
        input_ids.input_ids,
        attention_mask = input_ids.attention_mask,
        max_new_tokens=2,
        do_sample=True,
        temperature=0.4,
    )
    prediction = tokenizer.decode(output[0], skip_special_tokens=True)
    predictions.append((input_text, prediction, sample[1]))

print("Predictions of trained model ", predictions)

Questions:

Provide a brief description of the CommonsenseQA dataset. What kind of task was it developed for, what do the single columns contain?

What loss function is computed for this training? Provide the name of the function (conceptual, not necessarily the name of a function in the code).

Given your loss curve, do you think your model will perform well on answering common sense questions? (Note: there is no single right answer; you need to interpret your specific plot)

Inspect the predictions above. On how many test questions did the model predict the right answer? Compute the accuracy.