Sheet 7.1: Behavioral assessment & Evaluation#

Author: Polina Tsvilodub

This sheet focuses on evaluating the input-output (I/O) behavior of LLMs. Inspired by experimental paradigms / the terminology in cognitive science and psychology which investigate a blackbox (the human mind) via looking at the behavior across different interesting conditions (inputs), such assessment of LLMs (also blackboxes) can be called “behavioral assessment”. This approach can be seen as one piece that should work in combination with attribution methods discussed in the previous sheet in order to provide fuller understanding of what LLMs can or cannot do (I/O testing) and how they do it (attributions). Following the structure of the lecture, we will first look at practical aspects of benchmark testing, and then look at “machine psychology”, which often draws on the same methods but addresses somewhat different research questions.

Therefore, the learning goals of this sheet are:

  • look at examples of a few different benchmarks and how they are usually constructed

  • become familiar with standard evaluation metrics and methods used for evaluating LLMs on benchmarks (these include PPL, log probability based scores, accuracy, F1, free generation etc)

  • look at examples of machine psychology and how, in practice, LLM performance can be easily compared to human data.

Benchmark testing#

Such I/O evaluations are the most common approach to LLM evaluation. Taking a more technical / engineering-oriented perspective which aims at building LLMs for specific application, it is very common to make use of large benchmark datasets which are designed to test models’ performance on a variety of tasks in an automated way. This is often done by checking the models’ outputs against ground truth answers or by computing standard scores for certain datasets. Therefore, quality of LLMs is measured by their scores on these benchmarks.

Initially, these benchmarks were designed to test LLMs’ linguistic performance since the goal of building the model is a system that predict grammatical and fluent natural language. Therefore, some first benchmarks (or, commonly used textual datasets) are, for instance, Wikipedia texts, the Penn Treebank, and the GLUE benchmark. Wikipedia texts are often used for measuring the perplexity of the model on this standard text (see below for details). The Penn Treebank was often used for fine-tuning or evaluating models, e.g., on part-of-speech tagging as an approximation of syntactic performance, while the GLUE benchmark contains tasks which are supposed to approximate (semantic) natural language understanding in the form of paraphrase tasks, sentiment classification, natural language inference tasks etc.

Now recent LLMs have shown perhaps unexpectedly impressive generalization to tasks which seem to require more than linguistic fluency, like solving math and reasoning problems. Therefore, more recent benchmarks incorporate tests of various tasks going beyond linguistic capability. Two of the most widely used benchmarks include the MMLU and the BIG-Bench datasets. Given that SOTA LLMs are also often designed as assisstants and embedded in user-facing applications, it also became crucial to evaluate potenital social impacts that LLMs might exhibit with their outputs, like assessing biases and toxicity of the generations. To this end, specialized benchmarks like RealToxicityPrompts or WinoGender were created.

One crucial assumption behind benchmark evaluation is that benchmarks are representative of tasks and covers a wide variety of data that the model should perform well on in order to count as a good model for its target deployment. Although benchmarks arguably provide a wide coverage (they commonly contain thousands of inputs and answers), they often test only an approximation of what the model does in deployment (i.e., free text generation). Furthermore, with newer models trained on newer crawls of the internet, there are increasing worries of so-called contamination, i.e., actually including the test datasets in the training data of the models, thereby potentially inflating the models’ true generalization scores. For instance, Wikipedia is included in the training data of most of the modern models.

Scalably evaluating longer generated texts is quite a difficult task. This is because, intuitively, there is no single “ground truth answer” when it comes to writing; there are many equally good ways of writing summary of a text, or even potentially multiple ways of translating a sentence. This makes text evaluation difficult to evaluate automatically. This is still a largely unsolved issue (!), so that human or machine evaluation is often used. The available methods for automated text scoring are rooted in work on summarization and machine translation, and require (human-written) gold-standard texts.

Note that when mentioning a model in the explanations, we refer to trained models which are evaluated with respect to their performance, i.e., in inference mode. If one wanted to track the performance on certain benchmarks during training, one could also run evaluations on intermediate model checkpoints during training, too. Just note that the model is “frozen” and runs in inference mode during all of the testing described in this sheet.

In sum, the reasons why benchmarks are so widely used are a few core advantages:

  • the availability if a few well-known datasets leads to (somewhat of a) standardization of the evaluation procedure across different work.

  • their large scale often provides high coverage, more reliable results (although coverage might not always mean consistent quality or variability expected, e.g., by linguists).

  • crucially: they are design to be evaluated with easy to compute automatic evaluation metrics. You have heard about them in the lecture; we will recap these below and then work with them in practice.

Metrics#

Perplexity: It is computed as: $\(PPL_{LM}(x_0 ... x_n) = \exp(\frac{1}{n}\sum_{i=0}^n - \log P_{LM}(x_i \mid x_{<i})) \)$

Note that this is only applicable to causal language models. This is the metric commonly used, e.g., on the Wikipedia texts. For instance, the PPL of GPT-2 on the Penn Treebank dataset is 35.76, while the perplexity of GPT-3 on the same dataset is 20.50. The idea is that an ideal model should have a perplexity as close to 0 as possible for a naturally occurring text that it has learned, thereby approximating good fit to the “ground truth distribution of natural language”.

Below is some code for computing the perplexity of different sizes of GPT-2 for an exerpt from Wkipedia.

Exercise 7.1.1: Calculating perplexity

  1. Please complete the code below. (Hint: only one simple transformation is required in order to calculate the perplexity from the NLL loss)

  2. Compare the results for the models of different sizes. Does their comparison (ordering) match your intuition?

import torch
if torch.cuda.is_available():
    device = torch.device('cuda')
elif torch.backends.mps.is_available():
    device = torch.device("mps")
else:
    device = torch.device('cpu')
# perplexity evaluation
from datasets import load_dataset
import numpy as np
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gpt2")
test = load_dataset("wikitext", 'wikitext-2-raw-v1', split="test")

input_tokens = tokenizer(
    "\n\n".join(test["text"][:10]), 
    return_tensors="pt",
).input_ids.to(device)

# select a part of the text
# input_tokens = encodings[:,10:50]

# load models of different sizes
model_s = AutoModelForCausalLM.from_pretrained("gpt2").to(device)
model_xl = AutoModelForCausalLM.from_pretrained("gpt2-xl").to(device)

output_s = model_s(input_tokens, labels = input_tokens)
output_xl = model_xl(input_tokens, labels = input_tokens)
print("Average NLL for wikipedia chunk under small model ", output_s.loss.item())
print("Average NLL for wikipedia chunk under xl model ", output_xl.loss.item())

### your code for computing the perplexity goes here ###
perplexity_s = np.exp( output_s.loss.item())
perplexity_xl = np.exp( output_xl.loss.item())

print(f"PPL of smaller model: {perplexity_s}, PPL of larger model: {perplexity_xl}")

This blogpost provides an interesting outlook to dealing with the issue of fixed length of the context window of LMs when trying to compute the perplexity of longer texts (e.g., Wikipedia).

Accuracy: this is a standard metric widely used in many domains, not only NLP. It computes the proportion of correct responses in a set of tasks. Presupposes that there is a single correct answer for a given input. We have seen in the lecture that one way to compute accuracy is to score each answer option, given the input, under the LLM, and retreive the predicted options via \(argmax\); i.e., take the option for which the model assigned the highest (log) probability to be the chosen option. If this option is the ground truth option, the model’s prediction is correct for this test item (i.e., correctness = 1); otherwise, correctness = 0. Accuracy is then the average correctness across all the test items in the benchmark. The lecture pointed out limitations of the argmax approach. Just as a recap, the underlying assumption is that a model that can perform a task correctly will predict: $\(\log P_{LM}(\text{correct label} \mid \text{context}) > \log P_{LM}(\text{incorrect label} \mid \text{context})\)$

The advantage of this approach is that it makes sure to score only the available answer options under the model, which is an especially important constraint for weaker models. However, SOTA more powerful LLMs, especially if they are instruction-tuned are often also tested via text generation. I.e., the input is given with an appropriate instruction, and the model’s generated text is evaluated via string matching (e.g., regex of simple matching). If the correct answer option was generated, the model’s correctness is 1 for this trial, and 0 otherwise.

Below is some code exemplifying evaluating a model on a question answering benchmark CommonsenseQA which we have already used in the homework, via scoring answers under the model. This now provides an automatic implementation of the last task of HW1 / task 2 in HW2. For retrieving conditional log probabilities of different options, given a context, we will be using the package minicons.

Note that here we are interested in scoring the different response options, given the questions, under the model, rather prompting the model with a list of possible options and letting it generate the option label. Therefore, the wrangling of the dataset is slightly different than in the homework.

Exercise 7.1.2: Calculating accuracy

  1. Please complete the code below.

  2. Compare the results to your results from the homework. Which are better? Do you think the log probability based evaluation is better than the strategy we used in the homework? Why (not)?

  3. What is the expected chance accuracy on this dataset? Why is it important to consider chance accuracy when interpreting the results of a system?

  4. The lecture mentioned effects of various bias corrections that can be applied to the raw scores. In the code below, by default, a length correction is applied (i.e., average log probabilities are used). use the docs / examples of the minicons package here to retrieve “raw” log probabilities of the completions (i.e., sums over the token probabilities) and use those to calculate the accuracy. Do the results change?

# load dataset 
dataset = load_dataset("tau/commonsense_qa")
def massage_input_text(example):
    """
    Helper for converting labels, answer options
    into a single string.

    Arguments
    ---------
    example: dict
        Sample input from the dataset which contains the 
        question, answer labels (e.g. A, B, C, D),
        the answer options for the question, and which 
        of the answers is correct.
    
    Returns
    -------
    answer_options: list[str]
        Formatted list of answer options (e.g., 'A. <option 1> B. <option 2>' etc)
        and the ground truth answer.
    """
    # combine each label with its corresponding text
    answer_options_list = list(zip(
        example["choices"]["label"],
        example["choices"]["text"]
    ))
    # join each label and text with . and space
    answer_options = [f"{label}. {text}" for label, text in answer_options_list]

    return answer_options

# process input texts of validation dataset
massaged_dataset_val = dataset["validation"].map(
    lambda example: {
        "text": example["question"],
        "answers": massage_input_text(example),
        # get the index of the correct answer
        "label": example["choices"]["label"].index(example["answerKey"])
    }
)
massaged_dataset_val[0]
# iterate over part of the validation set an compute accuracy 
# (the test set doesn't have ground truth labels)

# set up a scorer 
from minicons import scorer 

lm_scorer = scorer.IncrementalLMScorer(
    'gpt2',
    device=device,
)
# initialize list for storing the correctness of the model predictions
correctness = []

for i in range(100):
    # get the ith example from the validation set
    example = massaged_dataset_val[i]
    # get the text of the question
    question = example['text']
    # get the list of answer options
    answer_options = example['answers']
    # get the ground truth label
    label = example['label']
    
    # pass a list of contexts and a list of continuations to be scored
    answer_scores = lm_scorer.conditional_score(
        # format the question into a list of same length as the number of answer options
        [question] * len(answer_options), 
        answer_options,
    ) 
    # get the predicted answer (Hint: check above how we determine what the model predicts is the correct answer)
    predicted_label = ### YOUR CODE HERE ###
    # check if the prediction is correct
    is_correct = predicted_label == label
    correctness.append(is_correct)

# compute the accuracy
print("Accuracy: ", np.mean(correctness))

F1-score:

This is a score that is commonly used on binary tasks (i.e., tasks with only two possible answer options) instead of accuracy. It is calculated from the precision and recall of the test results. The precision is the number of true positive results divided by the number of all samples predicted to be positive, including those not identified correctly. The recall is the number of true positive results divided by the number of all samples that should have been identified as positive. Here, positive and negative results refer to predictions in each of the two answer categories, respectively.

The F1 score is the harmonic mean of the precision and recall. It thus symmetrically represents both precision and recall in one metric: $\(F1 = 2 \times \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}\)\( The more generic \)F_{\beta}$ score applies additional weights, valuing one of precision or recall more than the other. The highest possible value of an F-score is 1.0, indicating perfect precision and recall, and the lowest possible value is 0, if precision and recall are zero.

We will use the BoolQ dataset from the SuperGLUE benchmark and evaluate GPT-2’s performance in terms of F1 scores on it. This is a task wherein the model has to predict an answer (true/false) to a question, given context. Therefore, the positive prediction here will be “true”, and the negative “false”.

You can find the test dataset here. We will retrieve the model’s predictions similarly to the accuracy evaluation above. Specifically, we will retrieve the probabilities of “true” and “false”, given the context and the question.

Exercise 7.1.3: Calculating F1 scores

  1. Please complete the code below.

  2. Calculate the results. Does GPT-2 do well in this task?

  3. Evaluate the performance of the model using accuracy. What is the conceptual difference between the two results? Which one might be more reliable and why?

  4. Find out how to compute the F1 score with the sklearn.metrics package.

import pandas as pd
df_boolq = pd.read_csv("files/super_glue_boolq.csv")
# inspect the dataset to understand its structure
# if is_true = 1, it means that the answer to the question is "True"
df_boolq.head()
predicted_answer= []
true_answers = []

for i, r in df_boolq[:200].iterrows():
    # get the context for the question
    context = r['sentence1']
    # get the text of the question
    question = r['sentence2']
    # construct the list of answer options
    answer_options = ["False", "True"]
    # get the ground truth label
    true_answer = r["is_true"]
    
    # pass a list of contexts and a list of continuations to be scored
    try:
        answer_scores = lm_scorer.conditional_score(
            # format the context + question into a list of same length as the number of answer options
            [context + " " + question + "?"] * len(answer_options), 
            answer_options,
        ) 
    except:
        continue
    # get the predicted answer (Hint: check above how we determine what the model predicts is the correct answer)
    predicted_label = ### YOUR CODE HERE ###
    # record the predicted answer
    predicted_answer.append(predicted_label)
    true_answers.append(true_answer)
# compute the F1 score
true_positive = sum([(i == j) & (i == 1) for i, j in zip(predicted_answer, true_answers)])
print("True positive: ", true_positive)
false_positive = sum([(i != j) & (i == 1) for i, j in zip(predicted_answer, true_answers)]) 
print("False positive: ", false_positive)
false_negative = sum([(i != j) & (i == 0) for i, j in zip(predicted_answer, true_answers)])
f1_score = # YOUR CODE HERE
print("F1 score: ", f1_score)

NLG metrics: The lecture discussed the common metrics for generation evaluation: BLEU, ROUGE and METEOR. We already used ROUGE in task 2 of HW 3. These metrics all check whether the predicted text overlaps with ground truth texts. Often different overlap measures are used; for instance, overlaps of unigrams, bigrams or trigrams can be computed. These metrics originate from summarization and machine translation, where corpora of reference human summaries or translations. These are also applied to any other generation tasks, too, as long as reference texts are available.

Below is space for trying out the BLEU score, in order to evaluate the translation predicted by FLAN-T5 small.

Exercise 7.1.3: Calculating NLG scores

  1. Please complete the code below by referring to the docs here.

  2. Calculate the results. What happens if you change the values of the max_order parameter, for this example and in general?

  3. If possible, try this out with a different language pair / a different sentence pair.

# import the implementation of the bleu score computation
from torchtext.data.metrics import bleu_score
# load model and tokenizer
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch

tokenizer_t5 = T5Tokenizer.from_pretrained("google/flan-t5-small")
model_t5 = T5ForConditionalGeneration.from_pretrained("google/flan-t5-small")

# define example sentences for translating from English to German
text_en = "All of the others were of a different opinion."
text_de = "Alle anderen waren anderer Meinung."
# define task 
prefix = "Translate to German: "

# encode the source and the target sentences
encoding_en = tokenizer_t5(
    [prefix + text_en],
    return_tensors="pt",
).input_ids
# we don't need the task prefix before the target
encoding_de = tokenizer_t5(
    [text_de],
    return_tensors="pt",
).input_ids

# predict with model
predicted_de = model_t5.generate(encoding_en)



# decode the prediction
predicted_decoded_de = tokenizer_t5.decode(
    predicted_de[0],
    skip_special_tokens=True,
)
print("Predicted translation: ", predicted_decoded_de)

# compute BLEU for the prediction
### YOUR CODE CALLING THE HELPER ABOVE GOES HERE ###
bleu = 

Outlook#

  • The log probability based scoring methods are generally only well-defined for causal LMs. However, the work by Salazar et al. (2019) introduces a pseudo log likelihood scoring for masked LMs.

  • The lecture and the sheet have pointed out the diversity in available evaluation methods of LMs, which might raise the natural question for you which method to choose and which one might work best. While this is an open research question, this great paper by Hu & Levy (2032) provides some insights regarding prompting vs. log probability based methods.

  • The lecture discussed the topic of calibration. There is a whole suite of work addressing calibration from a slightly more performance oriented perspective: the correlation of predicted probabilities of the correct response in multiple choice tasks is compared to the accuracy of the LM on those tasks, which is often put in context of LM’s knowledge and confidence about factual information. One influential paper is by [Kadavath et al. (2022) (https://arxiv.org/abs/2207.05221)].

Machine psychology#

As discussed in the lecture, there is another important perspective on evaluting LLMs that can be called machine psychology, which can provide better and more robust evaluation results of LLMs in tandem with benchmark testing.
This approach targets better understanding of different (e.g., emergent) capabilities of LLMs and is often informed by methods from psuchology, linguistics and cognitive science. There are several critical points that this perspective addresses:

  • The datasets and tests used here are often much more curated and, motivated by best practices of human research, cover diverse conditions related to the same phenomenon and better isolate that phenomenon (in contrast to more generic latge benchmarks).

  • Studies in this domain may aim to evaluate to what extent LLMs’ I/O behavior is human-like. This may be relevant, e.g., in user-facing scenarios where the systems are employed.

  • Finally, studies in this domain might shed light onto long-standing theoretical debates. For instance, recent models have been taken to provide evidence regarding the learnability of grammar from data only (without innate biases). This opinion paper provides details on this debate.

Importantly, the LLM prediction retrieval methods for investigating machine psychology are often similar or based on the benchmark evaluation methods. The difference often lies in the careful layout of the datasets, the hypotheses, and the overall methods for testing these hypotheses (e.g., supplemented with careful comaprison to human data).

The sections below provide some examples of research questions within machine psychology that were mentioned in the lecture, and practical implementations for addressing them.

First, we will look at targeted syntactic evaluation of LLMs and address the question of whether GPT-2 is capable of distinguishing grammatical and ungrammatical sentences.

Exercise 7.1.4: Machine psychology

  1. Please complete the code below. (The docs here might help)

  2. Compute the results. How would you answer the research question above, based on these results?

  3. What are alternative scores which could be used to test this question? Might any of them be better than the implementation below? Why?

grammaticality_df = pd.read_csv("files/grammaticality_tests.csv")
grammaticality_df
# iterate over the pairs of sentences and compare the grammatical and ungrammatical sentences
grammaticality_predictions = []
for i, r in grammaticality_df.iterrows():
    # get the grammatical sentence
    grammatical_sentence = r["grammatical_sentence"]
    # get the ungrammatical sentence
    ungrammatical_sentence = r["ungrammatical_sentence"]
    # compute sentence log probabilities
    grammatical_log_prob = lm_scorer.sequence_score(
        ### YOUR CODE HERE ###
    )
    ungrammatical_log_prob = lm_scorer.sequence_score(
        ### YOUR CODE HERE ###
    )
    # compare the log probabilities
    is_grammatical = ### YOUR CODE HERE ###
    grammaticality_predictions.append(is_grammatical)
    
print("Accuracy: ", np.mean(grammaticality_predictions))

Now, we address a reserach question at the intersection linguistic theory and methodological best practices. Specifically, following this paper, we want to understand whether:

  • LLMs can perform pragmatic language understanding tasks

  • whether they do so in a human-like way (in terms of mathcing human accuracy)

  • and whether different ways of retrieving LLM predictions lead to different fits to human data.

Specifically, we will focus on the interpretation of metaphors. The data from one LLM, namely GPT-3.5-turbo-instruct, and from humans, can be found here. The human data used in the paper and provided here is taken from the paper by Hu et al. (2022). An item in this dataset is a multiple choice task, and looks like this:

Context: Mary was asked about the town that she has just moved to. Mary responded: “This town is a chimney.” What does Mary mean? Answer options:

  • The town is not one of the cleanest one. (target nonliteral interpretation)

  • The people living in this town are very welcoming. (incorrect nonliteral interpretation)

  • All houses in this town have chimneys. (incorrect nonliteral interpretation)

  • The town is a chimney. (incorrect literal interpretation)

  • Mary found a job at a company installing chimneys. (incorrect distractor)

Exercise 7.1.5: Machine psychology 2

  1. Please look at the papers and complete the code below. What do the results tell us with respect to our research questions above?

metaphor_results_gpt = pd.read_csv("files/gpt_metaphor_results.csv")
metaphor_results_human = pd.read_csv("files/Human_Metaphor.csv")
metaphor_results_gpt
metaphor_results_human

Specifically, the GPT results were computed with different scoring methods which is recorded in the score column. First, we are interested in the question which score resulted in the highest accuracy for GPT (whether the prediction for a given item is correct is recorded in the column target):

### YOUR CODE HERE ###

Next, we are interested in comparing human and GPT results. Human results contain information whether the participant answered the item correcty in the column Correct. There are various ways of comparing the predictions. Following Hu et al (2022), we could compute the correlations of by-item accuracies of GPT and human data. The item IDs can be found in itemNum and item_id, respectively. One way to compute correlations in Python is documented e.g., here. Furthermore, might want to investigate the correlation separately for the different LLM scoring methods.

#### YOUR CODE HERE ####