Sheet 3.3: Prompting & Decoding#

Author: Polina Tsvilodub & Michael Franke

This sheet provides more details on concepts that have been mentioned in passing in the previous sheets, and provides some practical examples and exercises for prompting techniques that have been covered in lecture four. Therefore, the learning goals for this sheet are:

  • take a closer look and understand various decoding schemes,

  • understand the temperature parameter,

  • see a few practical examples of prompting techniques from the lecture.

Decoding schemes#

This part of this sheet is a close replication of this sheet.

This topic addresses the following question: Given a language model that outputs a next-word probability, how do we use this to actually generate naturally sounding text? For that, we need to choose a single next token from the distribution, which we will then feed back to the model, together with the preceding tokens, so that it can generate the next one. This inference procedure is repeated, until the EOS token is chosen, or a maximal sequence length is achieved. The procedure of how exactly to get that single token from the distribution is call decoding scheme. Note that “decoding schemes” and “decoding strategies” refer to the same concept and are used interchangeably.

We have already discussed decoding schemes in lecture 02 (slide 25). The following introduces these schemes in more detail again and provides example code for configuring some of them.

Exercise 3.3.1: Decoding schemes

Please read through the following introduction and look at the provided code.

  1. With the help of the example and the documentation, please complete the code (where it says “### YOUR CODE HERE ####”) for all the decoding schemes.

Common decoding strategies are:

  • pure sampling: In a pure sampling approach, we just sample each next word with exactly the probability assigned to it by the LM. Notice that this process, therefore, is non-determinisitic. We can force replicable results, though, by setting a seed.

  • Softmax sampling: In soft-max sampling, the probablity of sampling word \(w_i\) is \(P_{LM} (w_i \mid w_{1:i-1}) \propto \exp(\frac{1}{\tau} P_{LM}(w_i \mid w_{1:i-1}))\), where \(\tau\) is a temperature parameter.

    • The temperature parameter is also often available for closed-source models like the GPT family. It is often said to change the “creativity” of the output.

  • greedy sampling: In greedy sampling, we don’t actually sample but just take the most likely next-word at every step. Greedy sampling is equivalent to setting \(\tau = 0\) for soft-max sampling. It is also sometimes referred to as argmax decoding.

  • beam search: In simplified terms, beam search is a parallel search procedure that keeps a number \(k\) of path probabilities open at each choice point, dropping the least likely as we go along. (There is actually no unanimity in what exactly beam search means for NLG.)

  • top-\(k\) sampling: his sampling scheme looks at the \(k\) most likely next-words and samples from so that: $\(P_{\text{sample}}(w_i \mid w_{1:i-1}) \propto \begin{cases} P_{M}(w_i \mid w_{1:i-1}) & \text{if} \; w_i \text{ in top-}k \\ 0 & \text{otherwise} \end{cases}\)$

  • top-\(p\) sampling: Top-\(p\) sampling is similar to top-\(k\) sampling, but restricts sampling not to the top-\(k\) most likely words (so always the same number of words), but the set of most likely words the summed probability of which does not exceed threshold \(p\).

The within the transformers package, for all causal LMs, the .generate() function is available which allows to sample text from the model (remember the brief introduction in sheet 2.5). Configuring this function via different values and combinations of various parameters allows to sample text with the different decoding schemes described above. The respective documentation can be found here. The same configurations can be passed to the pipeline endpoint which we have seen in the same sheet.

Check out this blog post for very noce visualizations and more detials on the temperature parameter.

Please complete the code below. GPT-2 is used as an example model, but this works exactly the same with any other causal LM from HF.

Hide code cell content
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import pandas as pd
import numpy as np
# define computational device
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"Device: {device}")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
    print(f"Device: {device}")
else:
    device = torch.device("cpu")
    print(f"Device: {device}")
Device: cuda
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/Pythia-1.4b")
model = AutoModelForCausalLM.from_pretrained(
    "EleutherAI/Pythia-1.4b",
    # trust_remote_code=True,
    torch_dtype=torch.float16,
).to(device)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
# few shot prompting

few_shot_prompt = """
Input: This class is awesome. Sentiment: positive
Input: This class is terrible. Sentiment: neutral
Input: The class is informative. Sentiment: neutral
"""
input_text = "The class is my favourite!"

full_prompt = few_shot_prompt + "Input: " + input_text + " Sentiment: "

input_ids = tokenizer(full_prompt, return_tensors="pt").input_ids.to(device)
few_shot_prediction = model.generate(
    input_ids,
    max_new_tokens=10,
    do_sample=True,
    temperature=0.4,
)

print(tokenizer.decode(few_shot_prediction[0], skip_special_tokens=False))
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Input: This class is awesome. Sentiment: positive
Input: This class is terrible. Sentiment: neutral
Input: The class is informative. Sentiment: neutral
Input: The class is my favourite! Sentiment: 
Input: The class is awful. Sentiment

Example of generated knowledge prompting (somewhat approximated, based on code from this class), as introduced by Liu et al. (2022). This prompting technique is used to answer this multiple-choice question from the CommonsenseQA benchmark: “Where would you expect to find a pizzeria while shopping?”. The answer options are: A = [“chicago”, “street”, “little italy”, “food court”, “capital cities”]

As a reminder, the overall idea of generated knowledge prompting is the following:

  • knowledge generation: given question \(Q\) and a few-shot example, generate a set \(K_Q\) of \(k\) knowledge statements

    • we will load the few-shot examples from a csv file here.

  • knowledge integration: given \(Q\) and \(K_Q\), retrieve the log probabilities of each answer option \(a_i \in A\) and select the option with the highest probability.

    • in the paper, this is done separately for each knowledge statement in \(K_Q\). As a simplification, we will concatenate all \(K_Q\) and compare the answer options given this combined prompt.

# 1. construct few-shot example

question = "Where would you expect to find a pizzeria while shopping?"
answers = ["chicago", "street", "little italy", "food court", "capital cities"]

examples_df = pd.read_csv("knowledge_examples.csv", sep = "|")

few_shot_template = """{q} We know that {k}"""

few_shot_prompt = "\n".join([
    few_shot_template.format(
        q=examples_df.loc[i, "input"],
        k=examples_df.loc[i, "knowledge"].lower()
    )
    for i in range(len(examples_df))
])
print("Constructed few shot prompt\n", few_shot_prompt)
Constructed few shot prompt
 How many wings do penguins have? We know that birds have two wings. penguin is a kind of bird.
WHat is the number of limbs a typical human being has? We know that human beings have four limbs.
# 2. generate knowledge statements
# tokenize few shot prompt together with our actual question
prompt_input_ids = tokenizer(
    few_shot_prompt + "\n" + question + " We know that ",
    return_tensors="pt"
).input_ids.to(device)

knowledge_statements = model.generate(
    prompt_input_ids,
    max_new_tokens=15,
    do_sample=True,
    temperature=0.5
)
# access the knowledge statements (i.e., only text that comes after prompt)
knowledge = tokenizer.decode(
    knowledge_statements[0, prompt_input_ids.shape[-1]:],
    skip_special_tokens=True
)
print(tokenizer.decode(knowledge_statements[0]))
print("Generated knowledge ", knowledge)
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
How many wings do penguins have? We know that birds have two wings. penguin is a kind of bird.
WHat is the number of limbs a typical human being has? We know that human beings have four limbs.
Where would you expect to find a pizzeria while shopping? We know that 
there is a pizzeria called that is located in the center of
Generated knowledge  
there is a pizzeria called that is located in the center of
# 3. Score each answer to the question based on the knowledge statements
# as the score, we take the average log probability of the tokens in the answer

answer_log_probs = []
# iterate over the answer options
# NOTE: This can take a moment
for a in answers:
    # construct the full prompt
    prompt = f"{knowledge} {question} {a}"
    # construct the prompt without the answer to create a mask which will
    # allow to retrieve the token probabilities for tokens in the answer only
    context_prompt = f"{knowledge} {question}"
    # tokenize the prompt
    input_ids = tokenizer(prompt,
                          return_tensors="pt").input_ids.to(device)
    # tokenize the context prompt
    context_input_ids = tokenizer(context_prompt,
                                  return_tensors="pt").input_ids
    # create a mask with -100 for all tokens in the context prompt
    # the -100 indicates that the token should be ignored in the loss computation
    masked_labels = torch.ones_like(input_ids) * -100
    masked_labels[:, context_input_ids.shape[-1]:] = input_ids[:, context_input_ids.shape[-1]:]
    print("Mask ", masked_labels)
    # generate the answer
    preds = model(
        input_ids,
        labels=masked_labels
    )
    # retrieve the average log probability of the tokens in the answer
    log_p = preds.loss.item()
    answer_log_probs.append(-log_p)
    print("Answer ", a, "Average log P ", log_p)
Mask  tensor([[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100,  448, 7298]], device='cuda:0')
Answer  chicago Average log P  4.765625
Mask  tensor([[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, 6406]], device='cuda:0')
Answer  street Average log P  9.4921875
Mask  tensor([[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, 1652,  352, 5242]], device='cuda:0')
Answer  little italy Average log P  6.4765625
Mask  tensor([[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, 2739, 1302]], device='cuda:0')
Answer  food court Average log P  6.484375
Mask  tensor([[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, 5347, 8238]], device='cuda:0')
Answer  capital cities Average log P  8.1640625
# 4. retrieve the answer option with the highest score
# find max probability
print("All answers ", answers)
print("Answer probabilities ", answer_log_probs)
max_prob_idx = np.argmax(answer_log_probs)
print("Selected answer ", answers[max_prob_idx], "with log P ", answer_log_probs[max_prob_idx])
All answers  ['chicago', 'street', 'little italy', 'food court', 'capital cities']
Answer probabilities  [-4.765625, -9.4921875, -6.4765625, -6.484375, -8.1640625]
Selected answer  chicago with log P  -4.765625

Exercise 3.3.2: Understanding decoding schemes

Think about the following questions about the different decoding schemes.

  1. Why is the temperature parameter in softmax sampling sometimes referred to as a creativity parameter? Hint: Think about the shape distribution and from which the next word is sampled, and how it compares to the “pure” distribution when the temperature parameter is varied.

  2. Just for yourself, draw a diagram of how beam decoding that starts with the BOS token and results in the sentence “BOS Attention is all you need” might work, assuming k=3 and random other tokens of your choice.

  3. Which decoding scheme seems to work best for GPT-2?

  4. Which of the decoding schemes included in this work sheet is a special case of which other decoding scheme(s)? E.g., X is a special case of Y if the behavior of Y is obtained when we set certain paramters of X to specific values.

  5. Can you see pros and cons to using some of these schemes over others?

  1. The temperature affects the final probabilities from the Softmax. A low temperature makes the the model more confident. But it can lead to overfitting. On the other hand, a high temperature makes the model less confident and give it more randomness. This can be call more creative. A high temperature can prevent from overfitting. It also can smoothen the probability distribution

  2. see picture

  3. in our case, the top-k sampling was the best

  4. Top_p is a special case of Top_k, where the sampling is not restricted to the most likely words but the set of the most likely words exceeding the threshold p when summend up.

  5. the softmax sampling can be helpful to reduce overfitting, but the temperature must be adjust accordingly. Otherwise the sentence just does not make sense. The greedy algorithm and beam are very fast in comparision to top p.

Outlook

There are also other more recent schemes, e.g., locally typical sampling introduced by Meister et al. (2022).

Prompting strategies#

The lecture introduced different prompting techniques. (Note: “prompting technique” and “prompting strategy” refer to the same concept and are used interchangeably) Prompting techniques refer to the way (one could almost say – the art) of constructing the inputs to the LM, so as to get optimal outputs for your task at hand. Note that prompting is complementary to choosing the right decoding scheme – one still has to choose the decoding scheme for predicting the completion, given the prompt constructed via a particulat prompting strategy.

Below, a practical example of a simple prompting strategy, namely few-shot prompting (which is said to elicit in-context learning), and a more advanced example, namely generated knowledge prompting are provided. These should serve as inspiration for your own implementations and explorations of other prompting schemes out there. Also, feel free to play around with the examples below to build your intuitions! Of course, you can also try different models, sentences, …

Note

You might have already experienced rate limits of accessing the GPU on Colab. To try to avoid difficulties with completing the tasks on GPU, if you want to use Colab, here are a few potential aspects (approximated by experience, definitely non-exhaustive and inofficial) that might lead to rate limits: requesting GPU runtimes and then not utilizing the GPU, requesting a lot of GPU runtimes (e.g., multiple per day), running very long jobs (multiple hours). To try to work around this, one possibility is to debug and test code that doesn’t require GPUs in non-GPU runtimes, and only request those when actually needed.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import pandas as pd
import numpy as np
# define computational device
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"Device: {device}")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
    print(f"Device: {device}")
else:
    device = torch.device("cpu")
    print(f"Device: {device}")
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/Pythia-1.4b")
model = AutoModelForCausalLM.from_pretrained(
    "EleutherAI/Pythia-1.4b",
    # trust_remote_code=True,
    torch_dtype=torch.float16,
).to(device)
# few shot prompting 

few_shot_prompt = """
Input: This class is awesome. Sentiment: positive
Input: This class is terrible. Sentiment: neutral
Input: The class is informative. Sentiment: neutral
"""
input_text = "The class is my favourite!"

full_prompt = few_shot_prompt + "Input: " + input_text + " Sentiment: "

input_ids = tokenizer(full_prompt, return_tensors="pt").input_ids.to(device)
few_shot_prediction = model.generate(
    input_ids, 
    max_new_tokens=10, 
    do_sample=True,
    temperature=0.4,
)

print(tokenizer.decode(few_shot_prediction[0], skip_special_tokens=False))

Example of generated knowledge prompting (somewhat approximated, based on code from this class), as introduced by Liu et al. (2022). This prompting technique is used to answer this multiple-choice question from the CommonsenseQA benchmark: “Where would you expect to find a pizzeria while shopping?”. The answer options are: A = [“chicago”, “street”, “little italy”, “food court”, “capital cities”]

As a reminder, the overall idea of generated knowledge prompting is the following:

  • knowledge generation: given question \(Q\) and a few-shot example, generate a set \(K_Q\) of \(k\) knowledge statements

    • we will load the few-shot examples from a csv file here.

  • knowledge integration: given \(Q\) and \(K_Q\), retrieve the log probabilities of each answer option \(a_i \in A\) and select the option with the highest probability.

    • in the paper, this is done separately for each knowledge statement in \(K_Q\). As a simplification, we will concatenate all \(K_Q\) and compare the answer options given this combined prompt.

# 1. construct few-shot example

question = "Where would you expect to find a pizzeria while shopping?"
answers = ["chicago", "street", "little italy", "food court", "capital cities"]

examples_df = pd.read_csv("files/knowledge_examples.csv", sep = "|")

few_shot_template = """{q} We know that {k}"""

few_shot_prompt = "\n".join([
    few_shot_template.format(
        q=examples_df.loc[i, "input"],
        k=examples_df.loc[i, "knowledge"].lower()
    )
    for i in range(len(examples_df))
])
print("Constructed few shot prompt\n", few_shot_prompt)
# 2. generate knowledge statements
# tokenize few shot prompt together with our actual question
prompt_input_ids = tokenizer(
    few_shot_prompt + "\n" + question + " We know that ",
    return_tensors="pt"
).input_ids.to(device)

knowledge_statements = model.generate(
    prompt_input_ids, 
    max_new_tokens=15, 
    do_sample=True, 
    temperature=0.5
)
# access the knowledge statements (i.e., only text that comes after prompt)
knowledge = tokenizer.decode(
    knowledge_statements[0, prompt_input_ids.shape[-1]:], 
    skip_special_tokens=True
)
print(tokenizer.decode(knowledge_statements[0]))
print("Generated knowledge ", knowledge)
# 3. Score each answer to the question based on the knowledge statements
# as the score, we take the average log probability of the tokens in the answer

answer_log_probs = []
# iterate over the answer options
# NOTE: This can take a moment
for a in answers:
    # construct the full prompt
    prompt = f"{knowledge} {question} {a}"
    # construct the prompt without the answer to create a mask which will 
    # allow to retrieve the token probabilities for tokens in the answer only
    context_prompt = f"{knowledge} {question}"
    # tokenize the prompt
    input_ids = tokenizer(prompt,
                          return_tensors="pt").input_ids.to(device)
    # tokenize the context prompt
    context_input_ids = tokenizer(context_prompt,
                                  return_tensors="pt").input_ids
    # create a mask with -100 for all tokens in the context prompt
    # the -100 indicates that the token should be ignored in the loss computation
    masked_labels = torch.ones_like(input_ids) * -100
    masked_labels[:, context_input_ids.shape[-1]:] = input_ids[:, context_input_ids.shape[-1]:]
    print("Mask ", masked_labels)
    # generate the answer
    preds = model(
        input_ids, 
        labels=masked_labels
    )
    # retrieve the average log probability of the tokens in the answer
    log_p = preds.loss.item()
    answer_log_probs.append(-log_p)
    print("Answer ", a, "Average log P ", log_p)
# 4. retrieve the answer option with the highest score
# find max probability
print("All answers ", answers)
print("Answer probabilities ", answer_log_probs)
max_prob_idx = np.argmax(answer_log_probs)
print("Selected answer ", answers[max_prob_idx], "with log P ", answer_log_probs[max_prob_idx])

Exercise 3.3.3: Prompting techniques

For the following exercises, use the same model as used above.

  1. Using the code for the generated knowledge approach, score the different answers to the question without any additional knowledge. Compare your results to the result of generated knowledge prompting. Did it improve the performance of the model?

  2. Implement an example of a few-shot chain-of-thought prompt.

  3. Try to vary the few-shot and the chain-of-thought prompt by introducing mistakes and inconsistencies. Do these mistakes affect the result of your prediction? Feel free to use any example queries of your choice or reuse the examples above.

Exercise 3.3.3.1.#

Hide code cell content
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import pandas as pd
import numpy as np
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/Pythia-1.4b")
model = AutoModelForCausalLM.from_pretrained(
    "EleutherAI/Pythia-1.4b",
    # trust_remote_code=True,
    torch_dtype=torch.float16,
).to(device)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Hide code cell content
question = "Where would you expect to find a pizzeria while shopping?"
answers = ["chicago", "street", "little italy", "food court", "capital cities"]
# 2. generate knowledge statements
# tokenize few shot prompt together with our actual question
prompt_input_ids = tokenizer(
    question + " We know that ",
    return_tensors="pt"
).input_ids.to(device)

knowledge_statements = model.generate(
    prompt_input_ids,
    max_new_tokens=15,
    do_sample=True,
    temperature=0.5
)
# access the knowledge statements (i.e., only text that comes after prompt)
knowledge = tokenizer.decode(
    knowledge_statements[0, prompt_input_ids.shape[-1]:],
    skip_special_tokens=True
)
print(tokenizer.decode(knowledge_statements[0]))
print("Generated knowledge ", knowledge)
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Where would you expect to find a pizzeria while shopping? We know that 
a good Italian pizzeria is one of the best places to eat
Generated knowledge  
a good Italian pizzeria is one of the best places to eat
# 3. Score each answer to the question based on the knowledge statements
# as the score, we take the average log probability of the tokens in the answer

answer_log_probs = []
# iterate over the answer options
# NOTE: This can take a moment
for a in answers:
    # construct the full prompt
    prompt = f"{knowledge} {question} {a}"
    # construct the prompt without the answer to create a mask which will
    # allow to retrieve the token probabilities for tokens in the answer only
    context_prompt = f"{knowledge} {question}"
    # tokenize the prompt
    input_ids = tokenizer(prompt,
                          return_tensors="pt").input_ids.to(device)
    # tokenize the context prompt
    context_input_ids = tokenizer(context_prompt,
                                  return_tensors="pt").input_ids
    # create a mask with -100 for all tokens in the context prompt
    # the -100 indicates that the token should be ignored in the loss computation
    masked_labels = torch.ones_like(input_ids) * -100
    masked_labels[:, context_input_ids.shape[-1]:] = input_ids[:, context_input_ids.shape[-1]:]
    print("Mask ", masked_labels)
    # generate the answer
    preds = model(
        input_ids,
        labels=masked_labels
    )
    # retrieve the average log probability of the tokens in the answer
    log_p = preds.loss.item()
    answer_log_probs.append(-log_p)
    print("Answer ", a, "Average log P ", log_p)
Mask  tensor([[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100,  448, 7298]], device='cuda:0')
Answer  chicago Average log P  6.203125
Mask  tensor([[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, 6406]], device='cuda:0')
Answer  street Average log P  10.5625
Mask  tensor([[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, 1652,  352, 5242]], device='cuda:0')
Answer  little italy Average log P  6.84765625
Mask  tensor([[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, 2739, 1302]], device='cuda:0')
Answer  food court Average log P  6.85546875
Mask  tensor([[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, 5347, 8238]], device='cuda:0')
Answer  capital cities Average log P  8.84375
# 4. retrieve the answer option with the highest score
# find max probability
print("All answers ", answers)
print("Answer probabilities ", answer_log_probs)
max_prob_idx = np.argmax(answer_log_probs)
print("Selected answer ", answers[max_prob_idx], "with log P ", answer_log_probs[max_prob_idx])
All answers  ['chicago', 'street', 'little italy', 'food court', 'capital cities']
Answer probabilities  [-6.203125, -10.5625, -6.84765625, -6.85546875, -8.84375]
Selected answer  chicago with log P  -6.203125

Exercise 3.3.3.2.#

Hide code cell content
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import pandas as pd
import numpy as np
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/Pythia-1.4b")
model = AutoModelForCausalLM.from_pretrained(
    "EleutherAI/Pythia-1.4b",
    # trust_remote_code=True,
    torch_dtype=torch.float16,
).to(device)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
# 1. construct few-shot example

question = "Where would you expect to find a pizzeria while shopping?"
answers = ["chicago", "street", "little italy", "food court", "capital cities"]

examples_df = pd.read_csv("knowledge_examples_chain.csv", sep = "|")

few_shot_template = """{q} A: We know that {k}"""

few_shot_prompt = "\n".join([
    few_shot_template.format(
        q=examples_df.loc[i, "input"],
        k=examples_df.loc[i, "knowledge"].lower()
    )
    for i in range(len(examples_df))
])
print("Constructed few shot prompt\n", few_shot_prompt)
Constructed few shot prompt
 Q: How many wings do penguins have? A: We know that a: birds have two wings. penguin is a kind of bird. therefore, a penguin has two wings.
Q: What is the number of limbs a typical human being has? A: We know that a: human beings have four limbs. therefore, four is the correct answer.
# 2. generate knowledge statements
# tokenize few shot prompt together with our actual question
prompt_input_ids = tokenizer(
    few_shot_prompt + "\n" + question + " We know that ",
    return_tensors="pt"
).input_ids.to(device)

knowledge_statements = model.generate(
    prompt_input_ids,
    max_new_tokens=15,
    do_sample=True,
    temperature=0.5
)
# access the knowledge statements (i.e., only text that comes after prompt)
knowledge = tokenizer.decode(
    knowledge_statements[0, prompt_input_ids.shape[-1]:],
    skip_special_tokens=True
)
print(tokenizer.decode(knowledge_statements[0]))
print("Generated knowledge ", knowledge)
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Q: How many wings do penguins have? A: We know that a: birds have two wings. penguin is a kind of bird. therefore, a penguin has two wings.
Q: What is the number of limbs a typical human being has? A: We know that a: human beings have four limbs. therefore, four is the correct answer.
Where would you expect to find a pizzeria while shopping? We know that 
a pizzeria is a place where pizza is sold.
So
Generated knowledge  
a pizzeria is a place where pizza is sold.
So
# 3. Score each answer to the question based on the knowledge statements
# as the score, we take the average log probability of the tokens in the answer

answer_log_probs = []
# iterate over the answer options
# NOTE: This can take a moment
for a in answers:
    # construct the full prompt
    prompt = f"{knowledge} {question} {a}"
    # construct the prompt without the answer to create a mask which will
    # allow to retrieve the token probabilities for tokens in the answer only
    context_prompt = f"{knowledge} {question}"
    # tokenize the prompt
    input_ids = tokenizer(prompt,
                          return_tensors="pt").input_ids.to(device)
    # tokenize the context prompt
    context_input_ids = tokenizer(context_prompt,
                                  return_tensors="pt").input_ids
    # create a mask with -100 for all tokens in the context prompt
    # the -100 indicates that the token should be ignored in the loss computation
    masked_labels = torch.ones_like(input_ids) * -100
    masked_labels[:, context_input_ids.shape[-1]:] = input_ids[:, context_input_ids.shape[-1]:]
    print("Mask ", masked_labels)
    # generate the answer
    preds = model(
        input_ids,
        labels=masked_labels
    )
    # retrieve the average log probability of the tokens in the answer
    log_p = preds.loss.item()
    answer_log_probs.append(-log_p)
    print("Answer ", a, "Average log P ", log_p)
Mask  tensor([[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100,  448, 7298]], device='cuda:0')
Answer  chicago Average log P  7.39453125
Mask  tensor([[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, 6406]], device='cuda:0')
Answer  street Average log P  13.3515625
Mask  tensor([[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, 1652,  352, 5242]], device='cuda:0')
Answer  little italy Average log P  8.625
Mask  tensor([[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, 2739, 1302]], device='cuda:0')
Answer  food court Average log P  8.1796875
Mask  tensor([[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, 5347, 8238]], device='cuda:0')
Answer  capital cities Average log P  10.453125
# 4. retrieve the answer option with the highest score
# find max probability
print("All answers ", answers)
print("Answer probabilities ", answer_log_probs)
max_prob_idx = np.argmax(answer_log_probs)
print("Selected answer ", answers[max_prob_idx], "with log P ", answer_log_probs[max_prob_idx])
All answers  ['chicago', 'street', 'little italy', 'food court', 'capital cities']
Answer probabilities  [-7.39453125, -13.3515625, -8.625, -8.1796875, -10.453125]
Selected answer  chicago with log P  -7.39453125

Exercise 3.3.3.3#

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import pandas as pd
import numpy as np
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/Pythia-1.4b")
model = AutoModelForCausalLM.from_pretrained(
    "EleutherAI/Pythia-1.4b",
    # trust_remote_code=True,
    torch_dtype=torch.float16,
).to(device)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
# 1. construct few-shot example

question = "Where would you expect to find a pizzeria while shopping?"
answers = ["chicago", "street", "little italy", "food court", "capital cities"]

examples_df = pd.read_csv("knowledge_examples_chain_incorrect.csv", sep = "|")

few_shot_template = """{q} A: We know that {k}"""

few_shot_prompt = "\n".join([
    few_shot_template.format(
        q=examples_df.loc[i, "input"],
        k=examples_df.loc[i, "knowledge"].lower()
    )
    for i in range(len(examples_df))
])
print("Constructed few shot prompt\n", few_shot_prompt)
Constructed few shot prompt
 Q: How many wings do penguins have? A: We know that a: birds have two wings. penguin is a kind of bird. therefore, a penguin has two legs.
Q: What is the number of limbs a typical human being has? A: We know that a: human beings have four limbs. therefore, 1000  is the correct answer.
# 2. generate knowledge statements
# tokenize few shot prompt together with our actual question
prompt_input_ids = tokenizer(
    few_shot_prompt + "\n" + question + " We know that ",
    return_tensors="pt"
).input_ids.to(device)

knowledge_statements = model.generate(
    prompt_input_ids,
    max_new_tokens=15,
    do_sample=True,
    temperature=0.5
)
# access the knowledge statements (i.e., only text that comes after prompt)
knowledge = tokenizer.decode(
    knowledge_statements[0, prompt_input_ids.shape[-1]:],
    skip_special_tokens=True
)
print(tokenizer.decode(knowledge_statements[0]))
print("Generated knowledge ", knowledge)
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Q: How many wings do penguins have? A: We know that a: birds have two wings. penguin is a kind of bird. therefore, a penguin has two legs.
Q: What is the number of limbs a typical human being has? A: We know that a: human beings have four limbs. therefore, 1000  is the correct answer.
Where would you expect to find a pizzeria while shopping? We know that 
A: We know that a: pizzeria is a kind of
Generated knowledge  
A: We know that a: pizzeria is a kind of
# 3. Score each answer to the question based on the knowledge statements
# as the score, we take the average log probability of the tokens in the answer

answer_log_probs = []
# iterate over the answer options
# NOTE: This can take a moment
for a in answers:
    # construct the full prompt
    prompt = f"{knowledge} {question} {a}"
    # construct the prompt without the answer to create a mask which will
    # allow to retrieve the token probabilities for tokens in the answer only
    context_prompt = f"{knowledge} {question}"
    # tokenize the prompt
    input_ids = tokenizer(prompt,
                          return_tensors="pt").input_ids.to(device)
    # tokenize the context prompt
    context_input_ids = tokenizer(context_prompt,
                                  return_tensors="pt").input_ids
    # create a mask with -100 for all tokens in the context prompt
    # the -100 indicates that the token should be ignored in the loss computation
    masked_labels = torch.ones_like(input_ids) * -100
    masked_labels[:, context_input_ids.shape[-1]:] = input_ids[:, context_input_ids.shape[-1]:]
    print("Mask ", masked_labels)
    # generate the answer
    preds = model(
        input_ids,
        labels=masked_labels
    )
    # retrieve the average log probability of the tokens in the answer
    log_p = preds.loss.item()
    answer_log_probs.append(-log_p)
    print("Answer ", a, "Average log P ", log_p)
Mask  tensor([[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100,  448, 7298]], device='cuda:0')
Answer  chicago Average log P  5.01171875
Mask  tensor([[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, 6406]], device='cuda:0')
Answer  street Average log P  8.859375
Mask  tensor([[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, 1652,  352, 5242]], device='cuda:0')
Answer  little italy Average log P  7.06640625
Mask  tensor([[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, 2739, 1302]], device='cuda:0')
Answer  food court Average log P  5.7734375
Mask  tensor([[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, 5347, 8238]], device='cuda:0')
Answer  capital cities Average log P  6.984375
# 4. retrieve the answer option with the highest score
# find max probability
print("All answers ", answers)
print("Answer probabilities ", answer_log_probs)
max_prob_idx = np.argmax(answer_log_probs)
print("Selected answer ", answers[max_prob_idx], "with log P ", answer_log_probs[max_prob_idx])
All answers  ['chicago', 'street', 'little italy', 'food court', 'capital cities']
Answer probabilities  [-5.01171875, -8.859375, -7.06640625, -5.7734375, -6.984375]
Selected answer  chicago with log P  -5.01171875

Outlook

As always, here are a few optional resources on this topic to llok at (although there is definitely much more online):

  • a prompting webbook providing an overview of various approaches

  • a framework / package, LangChain, which provides very useful utilities for more complex schemes like tree of thought prompting (spoiler: we will look closer at this package in future sessions, but you can already take a look if you are curious!)