Table of Contents

From serving model to agents

Nowadays, LLM serving frameworks like vLLM and SGLang are becoming more popular. These frameworks provide a way to deploy LLM models as APIs in a scalable and cost-effective manner. However, these frameworks are designed for serving models, not agents.

What is “Agent”?

In order to successfully land the capabilities of LLMs in the real-world, we need to wrap the model with compound systems, this allow the LLM has the ability to interact with the world, and can be tailored to solve domain specific problems better. For example, a chatbot may help to answer your questions, but cannot help you to successfully fight a lawsuit. Agent – not only leverage the general intelligence of LLMs, but utilize the domain knlowedge to make decisions, tool-use to take actions, and multi-agent collaboration to solve complex problems.

With the process landing of LLM models, how to efficiently deploy an agentic system and scaling becomes critical questions, and can be really challenging based on the application scenarios.

How to design an LLM Agent serving framework?

There are basically two ways to use the LLM model, like Anthropic’s blog says.

  • Agent: more flexible way to use LLM intelligence, but also more challenging to deploy and maintain. It allows the model to dynamically direct its own processes and tool usage, maintaining control over how it accomplishes tasks.
  • Agentic workflow: more constrained way to use LLM intelligence, but also easier to deploy and maintain. It allows the model to follow a predefined set of steps and tool usage, limiting the model’s control over how it accomplishes tasks.

For both ways, we need to consider the following aspects:

  • Generative Model: The LLM model that the agent uses to generate contents.
  • Retrieval-Augmented Generation (RAG): usually a vector database to store domain knowledge or user data, along with efficient retrieval algorithms.
  • Function calling (or tool-use): A set of defined functions and tool APIs that the LLM can call to interact with the world. Anthropic introduces Model Context Protocol to standardize the tool-use.
  • Agent worker: Each worker is a tuned prompt or a script that can be used to interact with the LLM model, based on the breakdown of the business logic, each agent work will be responsible for a specific task.

A general agentic serving framework should be able to handle not only the LLM serving, but the RAG data retention, tool execution, and agent workflow management. It should be able to scale horizontally, and provide a way to monitor the system health and performance. We can call this “agentic serving”, which is different from the traditional LLM serving.

Optimization for agentic serving

Let’s say we are seving an agentic workflow with multiple agent works collaborating with each other. There are several optimizations we can do to improve the performance and reduce the cost if we are also serving the model locally.

Context-aware Sampling

In a multi-agent system, agents often work on related subtasks within the same domain. For example, when testing a multi-agent coding task from SWE-Bench, we observed that tokens such as “def”, “class”, and “name” appear frequently in Python programs, while words like “button”, “title”, and “margin” are more common in building web UI tasks. Although each agent has its own task and context, the auto-regressive sampler can benefit from shared context among agents due to the high token overlap.

We can call the optimization as context-aware sampling, which leverages N-gram speculative sampling to accelerate LLM inference. In this approach, during each sampling step the serving framework selects the top-N tokens from historical samples as draft tokens. The LLM then evaluates both the newly generated tokens and these draft tokens to produce a confidence score that determines whether a token is accepted. If a token is rejected, subsequent draft tokens are discarded and new tokens are sampled. Although N-gram sampling is not state-of-the-art, it avoids the complexity of incorporating an additional draft model. The key is, by sharing context among agents, the acceptance rate is significantly increased, offering a favorable trade-off between performance and sampling accuracy.

To prevent unbounded growth of the shared context and mitigate issues like catastrophic forgetting and hallucinations, we can control the input window size using a least-frequently used (LFU) cache. This cache maintains a token-level shared context by frequently updating it with new, relevant tokens while discarding less-used ones. In doing so, it not only prevents the context from growing indefinitely but also helps avoid local optima that could lead to repetitive or erroneous outputs.

In a single-agent configuration, N-gram autoregressive sampling remains a viable method; however, the limited reference context in this scenario may reduce sampling accuracy and yield only a minor boost in inference speed. Consequently, we can disables speculative decoding in such cases to preserve output quality.

Agent-level parallelism

In an agent system, the agents may be traveling from a state to another state, and the state transition is usually a directed graph. It is straightforward to think about running agent workers simultaneously if they are not dependent on each other. However, it is important to note that the graph may contain loops due to iterative processing. For example, an agent might be upstream of another in one iteration, but in the next iteration the dependency relationship can be reversed. Furthermore, if an agent receives its own output as the next round’s input, it must be processed sequentially rather than in parallel. This is the hard part of parallelism in multi-agent and multi-round systems: how to dynamically determine a safe combination of agents that can run in parallel at runtime, while treating self-dependent iterations as a single sequential step.

Analyzing the agent states and transitions can help identify the optimal parallelization strategy. But for a multi-agent system, the state can be a lot and dynamic, resulting a long time to analyze the state transition graph. We can use another LLM agent to help analyze the dependency graph and determine the optimal parallelization strategy in real-time. For agentic workflows, the analysis of the dependency graph can be done offline, and the optimal parallelization strategy can be determined in advance.

Disaggregated Tool Calling

Similar to PD disaggregation, the inference process in agents is more complex due to unpredictable resource allocation across different agent types and application scenarios. However, most of the computational overhead is attributed to the language model’s content generation. Therefore, we can simplify agent computations into two distinct categories: (1) internal generation workloads and (2) external tool usage. This separation allows us to optimize each component independently.

For the internal generation workload, numerous algorithms and best practices already exist to optimize inference speed. In contrast, external tool usage workloads vary by tool type; however, most non-AI functions have predictable execution times based on input size. To leverage this predictability, we can maintain a function runtime lookup table inside the agentic serving framework, that stores the expected execution range for each call. Each time a function is executed, its total latency is computed and used to update the lookup table. Using these time ranges, we can schedule function calls more efficiently by allocating resources accordingly, which ultimately increases system throughput.

We can utilize an MCP client-server architecture to offload function calls and other non-generative workloads (e.g., RAG retrieval) to dedicated execution devices or clusters. This design ensures that LLM generation remains isolated from ancillary tasks, simplifying debugging and facilitating scalable system deployment.

Conclusion

Those are just some of the optimizations that I can think of (it can have mistakes!), and there are many other ways to optimize the agentic serving system. The key is to understand the system architecture and the application scenarios, and then optimize the system based on the specific requirements. The agentic serving system is still in its early stage, and there are many challenges to overcome. However, with the rapid development of LLM models and the increasing demand for intelligent agents, I believe that the agentic serving system will become more mature and widely used in the near future.

Also the need of such system designs are increasing, and we are looking forward to more research and development in this area.