Table of Contents

  1. Starting with a Simple Agent
  2. Designing Compound AI Systems
  3. References

Starting with a Simple Agent

The simplest way to use LLMs is to leverage APIs provided by companies like OpenAI, Anthropic, or Huggingface. If you’re unsure how to start building an LLM Agent, a great starting point is to use these APIs for small projects.

For example, let’s create a small Python program that translates natural language into corresponding emojis. We can use OpenAI’s API for this task. OpenAI provides the following prompt on its website:

Role Content
System You will be provided with text, and your task is to translate it into emojis. Do not use any regular text. Do your best with emojis only.
User Artificial intelligence is a technology with great promise.

We can call OpenAI’s API with the following code:

from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
  model="gpt-4o",
  messages=[
    {
      "role": "system",
      "content": "You will be provided with text, and your task is to translate it into emojis. Do not use any regular text. Do your best with emojis only."
    },
    {
      "role": "user",
      "content": "Artificial intelligence is a technology with great promise."
    }
  ],
  temperature=0.8,
  max_tokens=64,
  top_p=1
)

To provide an interface for this application, we can write a simple command-line chat loop:

import questionary
from openai import OpenAI

client = OpenAI()
chat_history = [
    {
        "role": "system",
        "content": "You will be provided with text, and your task is to translate it into emojis. Do not use any regular text. Do your best with emojis only.",
    }
]

while True:
    user_prompt = questionary.text("You:").ask()
    chat_history.append({"role": "user", "content": user_prompt})
    response = client.chat.completions.create(
        model="gpt-4o", 
        messages=chat_history, 
        temperature=0.8, 
        max_tokens=64, 
        top_p=1
    )
    print("AI:", response.choices[0].message.content)

This completes a simple “Agent.” You will need an API Key to run this script from the command line:

export OPENAI_API_KEY="your-api-key"
python chat.py

Output:

? You: hello
AI: 👋😊
? You: a dog walking on the beach
AI: 🐕🚶‍♂️🏖️🌊

This is a simple example, but LLMs can handle tasks far beyond this. Depending on your business needs, you can try with different prompt designs.

Better Prompt Design

The prompt defines the task an LLM Agent needs to complete. Common LLM APIs include the following types of prompts:

  • System Prompt: Specifies the task requirements, such as “You will be provided with text, and your task is to translate it into emojis. Do not use any regular text. Do your best with emojis only.” This remains in the chat_history as the most critical instruction for the Agent.
  • User Prompt: Represents user input, such as “Artificial intelligence is a technology with great promise.” It is combined with the System Prompt to generate the next response.
  • Assistant: The Agent’s reply, such as “🐕🚶‍♂️🏖️🌊,” is added to chat_history as context for the LLM.
  • Tool: Some LLMs support function calls, and results from these functions are added to chat_history as context for the LLM.

For developers, designing a good system prompt is crucial. A good prompt should clearly and concisely explain the task requirements without being overly complex or verbose. Crafting system prompts is akin to modeling the task itself —an effective system prompt encapsulates the process and logic needed to solve the task.

For instance, if designing an LLM Agent to solve math problems, the system prompt might be:

System: Given a math problem, provide the solution.

Alternatively, you can transform this into a programming problem:

System: Given a math problem, write a Python function to compute a reasonable solution.

Then use a Python function execution tool to compute the result. This approach turns a pure language reasoning problem into a programming problem, leveraging the LLM’s code generation capabilities. Such designs often yield surprising results.

Designing Compound AI Systems

Compound AI Systems (Matei Zaharia et al) are defined as systems that address AI tasks using multiple interacting components, including multiple model calls, retrievers, or external tools. Beyond prompt design, these systems are essential for tackling real-world, complex applications.

Taking the math-solving example further, a compound AI system could include an LLM Agent and a Python function execution tool. The workflow would be:

1. Define the task for the Agent (System Prompt)
2. User inputs a math problem (User Prompt)
3. LLM Agent generates a Python function based on the prompts
4. Python function execution tool runs the function to obtain results

This system involves component interactions beyond simple prompting. For more advanced performance, additional modules can be introduced. For instance, if some math problems are not suitable for code-based solutions, the LLM could first classify the problem type and then choose between reasoning or programming, each with a different system prompt.

This enhanced system includes three components:

  • Prompt Router: Selects the appropriate system prompt based on the problem type.
  • LLM Agent: Generates results based on the system prompt and User Prompt.
  • Python Function Execution Tool: Runs LLM-generated code (for programming problems).

Function-calling

In the above example, a separate LLM call determines whether the input math problem is suitable for a code-based solution. This decision-making process can also be implemented through function calls if the Agent’s model supports them. OpenAI’s models from GPT-3.5-turbo onwards support function calls, as do open-source models like Llama-3-70B-Instruct-function-calling.

First, define a Python function to execute system commands:

import subprocess

def execute_command(command):
    try:
        result = subprocess.run(
            command, 
            shell=True, 
            check=True, 
            stdout=subprocess.PIPE, 
            stderr=subprocess.PIPE, 
            text=True
        )
        return result.stdout, result.stderr, result.returncode
    except subprocess.CalledProcessError as e:
        return e.stdout, e.stderr, e.returncode

Then define a schema to describe this function for the LLM:

# Code execution related function schema
schema_execute_command = {
    'name': 'execute_command',
    'description': 'Execute a command in the system shell. '
                   'Use this function when there is a need to run a system command, and execute programs.',
    'parameters': {
        'type': 'object',
        'properties': {
            'command': {
                'type': 'string',
                'description': 'The command to execute in the system shell'
            }
        }
    }
}

You can now use this function (tool) within the system prompt:

response = client.chat.completions.create(
        model='gpt-3.5-turbo',
        messages=[
            {'role': 'system', 'content': "Given a math problem, write a Python function to compute a reasonable solution. And execute the function using function execute_command."},
            {'role': 'user', 'content': "Solve the equation x^2 - 4 = 0"}
          ],
        functions=[schema_execute_command],
        function_call='auto'
    )

By setting function_call to auto, the LLM decides when to call this function. For manual invocation, set function_call to the function name. This approach enables the agent to flexibly determine whether to use programming to solve a math problem, achieving a more adaptable design.

The benefit of function-calling lies in offloading certain tasks to certain functions, reducing the LLM’s workload and reducing hallucination. However, challenges include identifying tasks suitable for function abstraction and controlling when functions are called.

RAG: Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) enhances generation by introducing external knowledge, improving outcomes. A typical use case is a question-answering system that retrieves knowledge base entries for accurate answers.

For LLM Agents, RAG is a commonly used method. A simple example is an agent for building user profiles for an e-commerce platform. If the system prompt is:

System: Create a detailed user profile.

The LLM may generate vague or unrealistic responses —what we call “hallucination.” Instead, if designed as:

System: Create a detailed user profile based on search history.
Search History:
- 2023-01-01: Searched for "iPhone 13"
- 2023-01-02: Searched for "MacBook Pro"
...

The resulting profile is more accurate and relevant to the user.

The LLM generates profiles grounded in external data, such as a vector database. Examples of vector databases include:

  • ChromaDB: Lightweight, open-source, API-friendly, local.
  • Pinecone: Cloud-based, supports multiple retrieval algorithms.
  • Qdrant: Open-source, high-performance, scalable, implemented in Rust.
  • Milvus: Open-source, high-performance, cloud-native, distributed.
  • LanceDB: Open-source, multimodal data friendly.

Using external data, such as user search records, as reference material for the Agent improves personalization and accuracy for specific tasks.

RAG can be combined with function-calling. For instance, suppose we have a function search_record to retrieve user search history. The system prompt can be:

System: Create a detailed user profile. Use the "search_record" function to retrieve search history as a reference.

The retrieved content is added to chat_history as LLM context:

System: Create a detailed user profile. Use the "search_record" function to retrieve search history as a reference.
Function call: search_record
- 2023-01-01: Searched for "iPhone 13"
- 2023-01-02: Searched for "MacBook Pro"
...
Assistant: The user prefers Apple products.

This decouples system prompts from external data, enhancing system flexibility. However, the effectiveness of this depends on the LLM’s in-context learning capabilities, which have limits.

Multi-agent Interaction

To extend the capabilities of a single LLM, consider designing systems with interactions between multiple agents. The key idea is to divide complex tasks into subtasks, assign each subtask to an agent, and integrate their results.

Multi-agent workflows generally fall into two categories:

  • Fixed Workflow: Agents follow a predefined sequence, each handling a specific module or task. Outputs are integrated into the final result.
  • Dynamic Workflow: Tasks are dynamically assigned to agents based on runtime conditions.

Fixed workflows suit tasks like writing a research paper, which involves steps such as problem formulation, literature review, writing, and proofreading—each handled by an agent.

Dynamic workflows are better for environments with no predefined order, such as virtual worlds where each NPC is an agent (Joon Sung Park et al):

If a multi-agent process has clear start and end points, it can be represented as a Directed Acyclic Graph (DAG). Each node is an agent, and edges represent interactions. Tools like LangGraph can help implement such systems.

In contrast, systems without defined directions can be understood as finite-state machines, where states and transitions depend on environmental inputs. Designing such systems requires defining states, transitions, and conditions.

Handling Hallucination

Function-calling, RAG, and multi-agent interaction all aim to enhance generation quality, with hallucination being a critical evaluation metric.

Hallucinations are typically of two types:

  • In-context Hallucination: LLM output contradicts retrieved content (e.g., generating B when retrieval result is A).
  • Extrinsic Hallucination: Output contradicts real-world logic, often due to biases in training data.

Strategies to mitigate these include:

  • Using External Data: Incorporating external data via RAG to improve generation accuracy and address extrinsic hallucination.
  • Function-calling: Offloading tasks to external tools to reduce LLM workload and in-context hallucination.
  • Multi-model Integration: Combining outputs from multiple models, cross validation (e.g., large and small models) for higher accuracy.
  • Filtering Results: Applying rules to filter illogical results.

However, excessive external data can mislead LLMs, increasing in-context hallucination:

Different models vary in their adaptability to external data, but all have limits. When designing compound AI systems, balance is essential—avoid over-reliance on external data, and optimize it for specific scenarios.

Post-processing can also control output. For example, filtering out illogical results:

# Filter unreasonable results within a given range
def filter_result(result, range):
    if result < range[0] or result > range[1]:
        return False

Recalculate for more reasonable outcomes if the results are unreasonable.

Safety and Evaluation

In production, evaluation ensures LLM outputs are reasonable. While simple rules suffice in basic scenarios (e.g., the filter example), production environments demand more extensive safety mechanisms, such as:

  • Fairness and bias regulation
  • Copyright protection
  • PII data protection

Fairness and bias involve two aspects:

  • Dataset Fairness and Bias: Business data inherently contains biases (e.g., demographics). Statistical methods, like Fairness Indicators, can assess fairness.
  • Model Fairness and Bias: Evaluating LLM fairness requires a small, curated dataset (Golden Dataset) with diverse features like gender, age, and region.

For copyright and PII protection, consider using an small classification model (e.g., Piiranha-v1) to detect sensitive information in LLM outputs. If detected, the output can be redacted or modified.

The genAI governance is complex, involving legal, ethical, and technical aspects. From the compound AI system perspective, a good prompt tracking and safty evaluation system is essential. Many agent building platforms provide such features, like Langfuse, LangSmith and TruLens, etc.

References