eriksfunhouse.com

Where the fun never stops!

Using State Machines to Orchestrate Multi-Agent Systems

Part 4 of N
April 19, 2025 Long

🚧 This article is still under construction. I'm actively working on it but wanted to share my progress. So forgive the poor writing and structure.
Check back later for updates! 🚧

Part 4: Improving agent reasoning

This is part four in a series on how to use a state machine framework to model agentic flows and how this approach enables some interesting features.

You can read the previous parts here 1, 2, 3.

The outline of this part is as follows:

Local versus global reasoning

There are typically two types of reasoning that happen in an agentic flow: local and global.

Local reasoning refers to the steps that an LLM might take to improve its own output to a specific input. This can happen completely internally to the LLM. Examples of this include self-reflection, CoT (Chain of Thought) reasoning, and other more advanced techniques taking place in modern reasoning models, like o1/o3/r1 etc.

Global reasoning refers to reasoning flows that happen across multiple LLM interactions, where in each step the agent is directed by the LLM to take an action, often executing a tool, then once done the result of that action is fed back into the LLM to guide the next action. In this case, we can view the reasoning flow as a sequence of blocks, where each block is the collection of states taking place between LLM calls.

Local and Global Reasoning

In the diagram above, local reasoning represents the steps between each block using an LLM call, where the input to the local reasoning step is the output of UserMessage and the output of the LLM is captured by the AssistantMessage.

The global reasoning steps are the stacks, where the input to each global reasoning step is the output of the previous LLM call and the output of the reasoning step is the input to the next LLM call.

In a previous part on steering agents, we discussed how to do self-reflection and other techniques related to local reasoning. In this part, we will look at how to improve the global reasoning capabilities of an agent when the LLM itself is a black box that can’t be retrained or improved.

What is ā€œgoodā€?

Before you can evaluate whether the trajectory of an agent or an individual reasoning step is good or bad, you need to:

  1. Precisely define what you are evaluating
  2. Define what good and bad looks like with respect to what you are evaluating

In the setting of our state machine framework, we choose to evaluate the saved outputs of the agent (i.e. the artifacts). So to determine whether a reasoning flow is good or bad, we need to decide whether the artifacts emitted by that flow are good or bad.

By doing that, we sort of ignore how the reasoning flow got to the output - we only look at the output itself. This refers back to something I mentioned in the previous part, where I said that the reasoning flows are often locally unstable but globally stable - you often reach equivalent outputs but through different paths.

Building an evaluator

What to learn

Since we treat the used LLM as a black box, we are not able to train its internal policy based on our labeled artifacts. Instead, we have to improve the reasoning trajectories of the agent by manipulating the flows that happen around and outside the black-box LLM. One way to do this is by having an evaluator - i.e. a method or model that can evaluate whether a given trajectory is good or bad.

Given that our definition of good or bad relies on the artifacts, we have to further roll up our global reasoning steps into blocks of steps that each produced at least one artifact as an outcome. That way, we can view the trajectory of global reasoning steps as a trajectory of artifact generation. We can then build an evaluator that, given a trajectory, decides whether it is good or bad based on the artifacts it has produced.

Artifact Blocks

Heuristics based evaluator

The easiest way to start off with an evaluator is with a simple heuristic one. In fact, this is what was used in DeepSeek’s r1 reasoning model, for example, and similarly in DeepMind’s AlphaProof.

However, in our case, that is a bit complicated since:

  1. The artifacts are in a form for which a rule-based evaluator would be nearly impossible to build
  2. Even with labeled examples of desirable outcomes, it is very difficult to tell if two outcomes are similarly good using a heuristic approach.

A heuristic evaluator would therefore, in our case, need to be specific to the individual use case of the types of agent it evaluates, which makes total sense but isn’t in the spirit of this post - so let’s look at a more general approach instead!

LLM-as-a-Judge evaluator

One way to approximate a heuristic, but without a specifically trained evaluator, would be to use a pre-trained LLM to act as the evaluator. In this case, we provide the LLM with a set of good and bad artifacts as examples and then prompt it to, given a new artifact, rate whether that artifact is good or bad. This can work particularly well if a well-described reason for each human labeling is provided, so that the subtleties of the human labeling are captured.

However, although this approach is a way around heuristics, it isn’t particularly good for a number of reasons:

  1. It can only keep a certain amount of examples in context at one time
  2. It is quite error-prone if the difference between good and bad is quite subtle
  3. It is very compute-intensive and time-consuming at scale

So let’s move on to another type of evaluator!

Learned evaluator

The better way is to learn the evaluator - i.e. to train a separate model to be the evaluator of whether a reasoning trace should be considered good or bad.

To do this, we either need an oracle - i.e. an objective way to rate outcomes - or we need a set of human-labeled examples to learn from.

The quality of an artifact is in most cases difficult to programmatically determine, like for images or code snippets. So in general, we have to rely on human labeling to label artifacts to use for training our evaluator.

Recall that in our setting, we assume that we do not have the ability to train the LLM itself but instead want to use the evaluator to direct the high-level agent flow towards a better outcome. We do this by letting the agent search through the space of reasoning traces, enabling it to focus on the most promising traces and to kill off the branches that are deemed low-quality.

The ranking of branches is done by using a trained evaluator to rate each artifact created within a branch and then averaging ratings across the branches. This gives us a score for each branch, which we can use to reject and kill off branches that are scored low.

RLHF and comparative ranking

RLHF is probably the most popular standard way of learning evaluators based on human feedback.

In our case, where we are aiming to ultimately attach a score to each artifact produced, a good way to approach the human labeling is using comparative ranking instead of absolute scoring. As shown in many studies, humans are much better and much more consistent when it comes to comparative ranking versus scoring - especially when rating complex items, like student homework and similarly complex insights generated by agents.

Once the data is labeled, we can then train a relatively simple LLM as the base model by fine-tuning through having it evaluate good versus bad examples of artifacts. In this case, the training input will then be two artifacts, each with some summarization of their task and their context, and the task will be to pick the best of two artifacts with the ground truth being the human labeling.

This task transfers over well to the more general task of rating the quality of an individual artifact.

Monte Carlo roll-outs with rejection sampling using branches

Now that we have a way to rate any interaction stack based on the artifacts saved during the reasoning of the agent of the interaction stack, we can use this to understand if a reasoning trace looks ā€œgoodā€ or not.

Combining this with a Monte Carlo approach for each reasoning step - i.e. allowing the agent to pick multiple next steps - allows you to generate a number of new potential steps and then only pick the one, or ones, that look the most promising. Practically speaking, this can be done in a number of ways - the easiest with most LLMs is simply to tweak its temperature to produce a set of next steps in the Monte Carlo path.

You can extend this further by combining it with tree search to let the agent continue along several promising paths at the same time, what is typically referred to as roll-outs - i.e. we ā€œroll-outā€ the potential path for several steps to see if it is a good path to go down.

Monte Carlo Tree Search

Since our trained evaluator works on artifacts, we can easily attach a score to each path by averaging (or summing, maxing, mining etc.) the total score across all artifacts produced by a given path and then discarding as we go along the lowest scoring trajectories. This logic can be made more complicated by comparing paths based on distributions of artifact scores instead of a single metric, if one enjoys complexity.

Learning the agent configuration as a policy

Above I said that since the LLM is a black box, we can’t change the behavior of the agent but have to instead work around it with an evaluator. This isn’t quite true - we can actually use the ability to dynamically change the configuration of the agent (as described in part 3) to change its behavior, and this can be done by a learned model - in some way similar to on-policy fine-tuning of the LLM.

Now instead of an RL-trained evaluator, we have an RL-trained agent-picker - i.e. a model that selects the tools and configuration of the agent that is then going to perform the next step. This way we avoid the computational complexity of MCTS and can instead rely on a single agent trajectory. This does, however, remove some of the creativity that comes with MCTS, where your population of potential paths taken and artifacts generated is much larger than when you run a single directed path.

Of course, the situation with DeepSeek’s r1 is quite different, since they retrained the entire underlying LLM based on their reasoning setup - the flow they used is in many ways very similar to what can be used to learn the policy for selecting the next agent configuration.

R1 Reasoning

In their case, they used REINFORCE, which is an on-policy RL algorithm, to learn the policy.

This works really well for us since the beauty of REINFORCE is that it does not need a value function. That means that it doesn’t require the ability to predict future rewards - instead, it plays out entire trajectories of the policy, Monte-Carlo style, and then uses the rewards collected from these to update its policy network.

Evaluations

In the context of agents and LLMs, the term evaluations typically refers to a set of tests or procedures that enable you to test the performance of a given LLM or agent as you change the configuration of either LLM or agent. Because of the stochastic nature and unstructured inputs/outputs of agents and LLMs, these are particularly difficult to construct and maintain.

However, another good thing about having a well-learned evaluator is that we can use it for evaluations - i.e. use the evaluator to test whether the quality of a given set of test use cases improves with LLM or configuration changes. We can then use this setup over time to monitor the quality of the outputs and whether the performance of the agents is generally improving or deteriorating.

Much more could be said generally about evaluations and how to construct them for complex multi-agent setups, but that will have to wait for another time!