Sheet 5.1 LLM agents

Sheet 5.1 LLM agents#

Author: Polina Tsvilodub

This sheet takes a closer look at more complex LLM-based systems and LLM agents. Specifically, we will use the package langchain and its extensions to build our own LLM systems and explore their functionality. The learning goals for this sheet are:

understanding basics of langchain
trying out langchain agents and tools
understanding the basics of output processing
familiarization with basic handling of agent memory

Langchain is under heavy development. Sometimes examples provided in the docs break with version updates, so one needs to be somewhat patient.

NOTE: At this point it provides quite vast functionality (and docs, respectively) – of course, we do NOT expect you to study or understand all of that. The examples below will provide links to some relevant parts of the documentation, and the examples serve a little demo / inspiration of what is out there, as a starting point for you to learn more, if you are interested.

LangChain#

The lecture discussed that modern LLMs can be viewed as building blocks of larger systems, be it for engineering or research purposes. In particular, one might want to use an LLM and make several calls (i.e., several inference passses) to it, and somehow use the predicted results together to complete one’s task. Note that when we talk about such systems, we (almost always) use the LLM for inference, i.e., the LLM is already pretrained / fine-tuned.

Using the terminology of langchain, a sequence of such LLM calls is called a chain. For each call, one minimally needs to specify a (pretrained / fine-tuned) LLM and a prompt that specifies what exactly the call should accomplish. For the prompt, oftentimes prompt templates are used. These prompt templates usually specify variables which are filled with inputs when the respective LLM call is invoked. The idea behind this is that the calls can be re-used, e.g., with various user inputs, without having to re-type the entire prompt. Further, these inputs may come from a previous LLM call. One neat feature of langchain is that it allows to seamlessly chain LLM calls and stream outputs from one call into the next. Specific types of templates (e.g., chat prompt templates) also take care of formatting text in the way expected by the model, e.g., adding the required special tokens and format for chat models.

[Disclaimer: Not sponsored by LangChain – there are other very useful tools for doing such things, for instance, Haystack. This is just one popular example.]

Below, we will first look at a an example of a simple sequence of LLM calls. In particular, we will build a system that helps us to come up with a dinner menu, given some ingredients that we already have.

We will be using the NVIDIA API to get access to performant models for optimal performance. Instructions for retreiving the an API key will be provided in class.

# please install the following packages and versions

#!pip install langchain-community langchain-openai langchain_nvidia_ai_endpoints==0.3.9 langchainhub duckduckgo-search wikipedia python-dotenv

import os
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_nvidia_ai_endpoints import ChatNVIDIA
from langchain_core.rate_limiters import InMemoryRateLimiter

# set some hyperparameters for the generation
temperature = 0.7
kwargs = {
    "max_tokens": 100,
}
# define which model to query
MODEL_NAME = "meta/llama-3.3-70b-instruct"

ingredients = "cauliflower, tomatoes."

instructions_text_appetizer = "I have the following ingredients in my fridge: \n{ingredients}\n\nWhich Italian appetizer can I make for dinner with these ingredients?"

instructions_text_main = "I am planning to make the following appetizer: \n{appetizer}\n\nWhich Italian main course can I make for my dinner?"

instructions_menu_summary = "I am planning the following recipes for my dinner: \nAppetizer: {appetizer}\nMain course: {main_course}\n\nPlease write a menu summary for my dinner."

NVIDIA_API_KEY = ### YOUR API TOKEN

# instantiate model
rate_limiter = InMemoryRateLimiter(
    requests_per_second=35 / 60,  # 35 requests per minute to be sure
    check_every_n_seconds=0.1,  # wake up every 100 ms to check whether allowed to make a request,
    max_bucket_size=7,  # controls the maximum burst size
)

llm = ChatNVIDIA(
      model=MODEL_NAME,
      api_key=NVIDIA_API_KEY, 
      temperature=0,   # ensure reproducibility,
      rate_limiter=rate_limiter  # bind the rate limiter
  )

# construct prompts for our calls
prompt_template_appetizer = PromptTemplate(
    template = instructions_text_appetizer,
    input_variables = ['ingredients'],
)
prompt_template_main = PromptTemplate(
    template = instructions_text_main,
    input_variables = ['appetizer'],
)
prompt_template_summary = PromptTemplate(
    template = instructions_menu_summary,
    input_variables = ['appetizer', 'main_course'],
)
# construct sub-chains for each course
appetizer_chain = prompt_template_appetizer | llm | StrOutputParser()
main_chain = {"appetizer": appetizer_chain} | prompt_template_main | llm | StrOutputParser()
composed_chain = {"appetizer": appetizer_chain, "main_course": main_chain} | prompt_template_summary | llm | StrOutputParser()

# actually call the execution of the entire chain
composed_result = composed_chain.invoke({"ingredients": ingredients})
print("Result: ", composed_result)

Exercise 5.1.1: LLM chain

Please look at the code above and try to understand what it does. Relevant docs about LLM calls can be found here, about chains here.

Add a third course to our menu! Of course, you can also play around with the other prompts and the sequence.

Excercise 5.1.1.2#

Agents#

In the system above, we have decomposed the task of creating a dinner menu into “bite-sized” pieces for LLM calls ourselves; i.e., we have specified the order and the specific prompt for the single calls ourselves. Next, we will try to avoid these steps, and use an agent instead: i.e., we will pass our overall task description to an LLM and let it figure out the necessary substeps on its own. Specifically, we will use a React agent.

# same task with agent
from langchain import hub
from langchain.agents import AgentExecutor
from langchain.agents import create_react_agent

# Get an example prompt from langchain that was constructed for this agent architecture. you can modify this!
prompt = hub.pull("hwchase17/react")
# inspect the prompt
prompt.template

For most modern out of the box agents, the agent instantiation accepts a list of tools. The agent tried to make use of the tools above. Below, we add such tools to the agent – specifically, we will provide it with a tool to call the Wikipedia API anf to a search API for real time searches.

from langchain.agents import load_tools
from langchain_community.tools import DuckDuckGoSearchRun

search = DuckDuckGoSearchRun()
tools = load_tools(["wikipedia"], llm=llm)
tools.append(search)
print('tools',  tools)

# create an agent with tools
agent_with_tools = create_react_agent(llm, tools=tools, prompt=prompt)

# instantiate and call the agent
agent_executor = AgentExecutor(agent=agent_with_tools, tools=tools, verbose=True)
agent_executor.invoke({"input": "Please help me come up with a three course Italian dinner menu. It should be vegetarian. I have cauliflower and tomatoes in my fridge."})

Exercise 5.1.3: LLM agents with tools

Please look at the code above and try to understand what it does. A list of various tools can be found here here.

What steps does the agent (try to) perform in order to accomplish the task? How does it “know” which steps to do when? When does it execute searhces?

Compare the results to the chain above. Do you observe differences?

Is the Wikipedia tool a good choice for the task at hand? What else might we consider?

The model is searching stepwise. In the firt step it searches for the whole information, getting more concrete from step to step. It even focuses on specific ingredients, like a human would do.

The results the model found are vegetarian, using the given ingredients and are italien. Therefore, the results are better than the outputs from the models before.

Wikipedia is interesting for theoretical information about italian food and vegetarian food, but does not provide detailed recipies. The possibilities are rather limited. Therefore, a tool concentrating on recipies might be helpful to get a better variety of recipies. Nonetheless, the results the model found are better than the results of the other models.

For the sake of seeing the top possible performance of agents, we have used OpenAI models. But since they are behind a paywall, we might also want to use open-source models as a backbone for the agent. LangChain also provides integration with many various LLMs, including HuggingFace models that can be used via the HF API endpoint which might not always be available and requires and HuggingFace account, or with a local model loaded via transformers as we have learned. The latter requires downloading the model; since agent LLM systems require good performance of the backbone LLM, it can rather be tested with large models with at least a few billion parameters (mind their size for downloads!).

Output parsing#

One of the core bottlenecks of chaining LLM calls is the potential necessity to process outputs in a specific (structured) way. This page provides an overview of how this can be approached. This is step is key for enabling integration of LLMs into automatic systems where other components depend on outputs of LLMs and usually expect particular input types or formats.

There are also packages / frameworks specialized in interfacing with LLMs and properly parsing their outputs like, e.g., LMQL and this controller framework from Microsoft.

These resources are intended as optional useful information, in case you will explore and build your own agents; you are not expected to have looked at them in detail.

Memory handling#

One of the main issues of agents is that they are by default stateless; i.e., at each step of execution there is no memory of what happened before. This is handled by adding memory components. An overview of this can be found here.

Exercise 5.1.4: Memory

Take a look at the approach for handling message memory above. Recall the generative agent architecture that was discussed in the lecture. What is the difference between this simple approach and the memory implementation in the generative agents? What are respective (dis)advantages of either approach?