AI Product *Teams* are Distributed Reinforcement Learning Systems

Table of Contents

Brain dump

This post is a a real brain-dump, and, to be fair, a bit of a hot mess. I’m interested in this topic, so I will try to come back to this and outline some of these ideas in a more digestible format.

On Socio-Technical Systems #

In one of my favourite classic cognitive science papers, Edwin Hutchins (1995) wrote How a Cockpit Remembers Its Speeds (pdf). Cognitive scientists were used to thinking of people as agents — an agent being an intelligent system that has goals, perceives its environment, represents knowledge about about the environment in some internal state, and performs actions to affect the environment. The big idea of this paper is that we can think of broader systems, for instance the pilots in an airplane cockpit and the equipment they use, in the same terms, and analyse the properties — the goals, environment, representations, and actions — of these systems.

He probably puts it better:

Cognitive science normally takes the individual agent as its unit of analysis. In many human endeavors, however, the outcomes of interest are not determined entirely by the information processing properties of individuals. Nor can they be inferred from the properties of the individual agents, alone, no matter how detailed the knowledge of the properties of those individuals may be. In commercial aviation, for example, the successful completion of a flight is produced by a system that typically includes two or more pilots interacting with each other and with a suite of technological devices. This article presents a theoretical framework that takes a distributed, socio-technical system rather than an individual mind as its primary unit of analysis. This framework is explicitly cognitive in that it is concerned with how information is represented and how representations are transformed and propagated in the performance of tasks. An analysis of a memory task in the cockpit of a commercial airliner shows how the cognitive properties of such distributed systems can differ radically from the cognitive properties of the individuals who inhabit them.

Here’s my thesis. While most AI product development teams working today won’t be training or even fine-tuning the foundation models they work with, the AI-powered products they build, plus the product team itself, form a system which, as a whole, constitutes a distributed reinforcement learning “agent” in the traditional sense.

That might be a fairly hot take, so let me work through it.

What’s an “agent”? #

Bear with me

This will all be familiar to AI practitioners, but it’s worth reiterating.

Unfortunately, while the term “agent” has lost all meaning in the LLM community, it has a pretty clear-cut meaning in more traditional AI and cognitive science. Agents:

Have goals they try to achieve.
Exist in an environment with which they interact
- They have sensors which provide them with information from the environment. They perceive. They have inputs.
- They have actuators which allow them to affect the environment. They act. They have outputs.
It’s often useful, in cognitive science to also think of agents having internal representations: memory for information about the environment and their goals.

A crucial part of any agent is their policy, the rules they follow to decide on their actions, based on the what they perceive and remember (collectively called their state). A policy maps states to actions. We usually talk about the value of actions and the policies that produce them, where something is valuable if it helps an agent achieve whatever its goals are.

Some agents, like old-fashioned chess computer programs, have their policies coded by hand. Others, like plants, humans, and the Netflix recommendation algorithm, adjust their policies over time by learning from experience. Doing this effectively usually requires dealing with what’s called the explore-exploit trade-off: finding a balance between actions that move them immediately towards their goals, and alternative actions that might be less immediately valuable but provide useful information to learn from. Everyone’s favourite example of this is the problem of choosing a restaurant for lunch, but I’m not going to repeat it here.

Reinforcement learning is the field of computer science that deals with agents that update their policies with experience and navigate this explore-exploit trade-off. So, tying all this together, a reinforcement learning agent a system that has goals, interacts with its environment, and learns to improve its behaviour as it learns from response signals from its environment.

Info

If you haven’t already read Sutton & Barto’s Reinforcement Learning: An Introduction, you should read it. If you have read it, you should probably read it again. God, it’s good.

Reinforcement Learning from Human Feedback #

There’s already a well-known application of reinforcement learning to improve foundation models and the products they drive, like ChatGPT. In the classical sense above, generative AI models are already agents, just not very interesting ones. Let’s focus on the GPT class of language model. The model’s “goal” is to output the token that is most probable to appear after the sequence of tokens that has already been provided in the “prompt”, given the training data. Its inputs are the prompt tokens. It’s output is the predicted next token.

By using the language model in a loop where the output token is added to the prompt and then the whole thing is fed back to the model as a new input until done. This is the system that most people will be familiar with, and can also be analysed as an agent: the input is original user-provided prompt, and the output is now a whole sequence of new tokens. What’s the goal of this larger system? With only pretrained foundation models, we can think of the goal as still being to output tokens that are probable given the prompt and the training data.

Reinforcement Learning from Human Feedback is the innovation that made ChatGPT possible. There’s a lot said about how this works, so I won’t repeat it here, bar to say that the model is made to generate multiple alternative output sequences, a reviewer (originally a human, more typically another AI) identifies the best/most valuable one, according to their goals, and the model’s weights are updated (its policy is updated) to make that output more likely in future. What constitutes the “agent” here? I think the most useful way to see it is that the language model itself is still the agent, while the reviewers form the environment wit`h which the agent interacts and from which the value signals come.

Info

This BlueSky thread is a fantastic elaboration on the idea I present here.

Here's why "alignment research" when it comes to LLMs is a big mess, as I see it.
Claude is not a real guy. Claude is a character in the stories that an LLM has been programmed to write. Just to give it a distinct name, let's call the LLM "the Shoggoth".
— Colin (@colin-fraser.net) 19 December 2024 at 23:15

How do AI Product teams work? #

Now, let’s outline the typical approach a product development team takes to building an foundation model-powered product in 2025. To be concrete, I’ll use a RAG-enabled chatbot as an example, but I think these ideas apply to most products where model-generated outputs have a significant role in the user experience. To be even more concrete, let’s say the product is something to help people practice their French.

First, the team builds an MVP of the product. Let’s say this consists of a UI where the user types messages and reads responses, and a backend that takes the user input, uses a retrieval pipeline to retrieve relevant contextual information from a database, and then combines the user’s message, the retrieved text, the conversation history, and a “system” prompt providing instructions for how the language model should response, and sends all of this information off in a request to the language model provider. The provider responds with a reply generated by the model, which is forwarded on to the user, who replies, and the whole loop goes again.

This product is clearly another agent, in the sense above. The goal is for the user to have a positive experience and improve their language skills. The inputs and outputs are still the user’s messages and the chatbot’s responses, although the internal representations have become more complicated. Is it a reinforcement learning agent? In one sense, yes, it is, because if the user provides explicit feedback, for instance asking the chatbot to user simpler vocabulary, that message will remain in the conversation history, and will affect the subsequent outputs generated by the LLM, a case of in-context learning. However, aside from this kind of explicit verbal feedback, this product/system is not generally processing signals about how well it is doing at achieving its goals, and so I’m inclined to say no, this is not a reinforcement learning agent.

It's philosophical

If you’re looking for “right” and “wrong” answers, you won’t find them here. For questions like these, the answer is always “well, it depends what you mean by X”, but the questions are still worth asking, I think.

The Payoff #

Now, let’s expand our view out a little more, and get to the point.

Typically, a product team will have an explicit goal to boost a “North star” metric like user engagement or retention, or maybe client satisfaction. Lead by the product manager, the team will put in place ways to track this metric and other signals related to it, through a combination of analytics and, ideally, qualitative user research. They will, or at least should, also be recording actual conversations that users have with the chatbot, and regularly review them for insights into how the user experience could be improved, and thus the North star metric boosted. (If they’re really good, they’ll also be running their chatbot through a constantly-evolving internal evaluation suite, but that’s a topic for another post.)

The product team might do a few things with these insights. They can change the UI, tweaking aspects of the product that don’t directly related to the language model, but let’s not focus on that. They can make changes to how the retrieval system works. They can iterate on the system prompt, to change or more clearly articulate how they would like the language model to respond. They can switch to a different language model. All of these actions taken by the product team have the effect of updating the policy that the product agent follows: the rules that map it’s inputs to outputs.

So, finally, we have a pretty interesting distributed agent here, made up of:

A language model
A product in which the language model is embedded
A product development team, consisting of
- The people in the team
- Their analytics and user research processes

The agent’s goal is to improve whatever they have as their North star metric, although of course, in a distributed system like this, other implicit goals will also play a role, such as not wanting to burn out through overwork, wanting to get on well with colleagues, and so on. The agent’s environment includes the users of the product, at any rate, although it’s easy to see how this analysis could get more complicated as we start taking other external factors that impact on a product team (senior leadership changing priorities?) into account. Its inputs are all of the sources of user insights described above, and its outputs are the actions they take based on these insights.

Therefore, an AI product-development teams constitutes a distributed reinforcement learning “agent”.

Appendix: Graduate Student Descent #

The spark for this whole meandering essay was the realisation that there’s an analogy between what I’ve described above — essentially policy optimisation via product manager critique, and the well-known idea of graduate student descent in deep learning (a play on gradient descent).

The crux of graduate student descent is this. When fitting machine learning models to data, there’s a difference between parameters, which are learned from the training data, and hyperparameters, additional settings which must be specified before training the model. For small, cheap-to-train models, with a small number of hyperparameters, we can try fitting the model across a range of sensible hyperparemeter values and pick the ones that work best, or use an optimisation algorithm to search for values that work well. For larger, expensive-to-train models, which also tend to have more hyperparemeter, this is more difficult, because the number of possible values to try is larger, and the cost of trying out particular values by retraining the model is too expensive.

Large AI labs have a solution. It turns out machine learning researchers are better than optimisation algorithms at identifying good hyperparemeter values to try. Graduate students, the lowest-paid of these researchers, are typically the ones given the task of picking hyperparameters, waiting for the model to train, looking at the results, and picking the next hyperparameters. Thus, in the spirit of the analyses above, the machine learning model and the graduate student, together, constitute a more powerful learning system than the model alone.

Appendix: Generalising Gradient Descent #

On the topic of gradient descent, it’s also interesting to think about how that idea generalises to the kind of distributed, human-in-the-loop systems we’re discussing here. Very briefly, gradient descent is the idea in machine learning that you train your model by taking a single training data point, seeing what prediction your model currently makes for it (e.g. “60% probability that an image is a cat”) and comparing that to the actual outcome (e.g. image is a cat) to get the prediction error. From this, you can infer how the prediction should be improved (in this case, the probability that this image is a cat should be higher), and so also calculate whether each parameter should be increased or decreased to achieve that. By doing this repeatedly for all your data points, and only changing the parameter values a little bit for each point, you eventually converge on the parameters that best fit the training data.

This is standard practice in classical machine learning models, where the output is typically either a categorisation, or a predicted score on some scale, and so the prediction error is something that can easily be expressed in numbers. For generative text models like the GPT family, predicting the next token in a sequence is technically a categorisation task under the hood, where each of the ~200K possible tokens is a different category, and the task is identify the category to which the text so far belongs.

But when you’re in a product team, looking at transcripts of conversations between users and your chatbot, what are the prediction errors? What are the gradients? Clearly, the predictions errors will be fuzzy, semantic things, which can’t easily be reduced to simple, tractable numbers. For example, you might identify that the chatbot is too quick to correct user’s mistakes, rather than walking them through a process to identify mistakes themselves. As a result, identifying the gradients — the changes that must be made to reduce this mismatch between intended behaviour and reality — is going to be an art rather than a science. Should you update the prompt? Where and how should you change it? Could this be addressed by changes elsewhere in the UX? We’re not at a point where these processes can be effectively automated, although the idea of automatically rewriting prompts in response to feedback is clearly an active area of research.