Jonathan Wall, founder and CEO of Runloop AI, discusses why agents require a departure from traditional server architecture. He explores how isolated computing environments unlock agent capabilities and why agent-native infrastructure is the next critical layer in the AI stack

 
 
 
DOWNLOAD TRANSCRIPT

312 Audio.mp3: Audio automatically transcribed by Sonix

312 Audio.mp3: this mp3 audio file was automatically transcribed by Sonix with the best speech-to-text algorithms. This transcript may contain errors.

JONATHAN:

I thought there was it would be interesting to apply the lens of infrastructure to cut kind of the coming agentic revolution. Got together with some former teammates from my previous startup and hired some new folks to start run loop.

CRAIG:

Right. And so the agent operates in its own virtual machine, not necessarily a web page, not necessarily a browser. Or is it a browser?

JONATHAN:

Agents work differently enough and have different enough compute patterns that they need their own new compute primitive. And that is core to what we're building here at RunMove. We call that our dev box. So there are some folks that make agent builders, and that is super useful and super necessary. Some of them are paired to different frameworks. Like the folks at Langchain have a cool agent builder that is tightly coupled to Langchain, for example. But what we're playing is a little lower, it's a little closer to the infrastructure layer. And we ultimately want to be the runtime where the agent is deployed and where it executes.

CRAIG:

And then talk about what Run Loop is. And then we'll talk. There's some of the stuff in the materials that was sent over that was interesting to me. You talk about agents changing the shape of compute. Yeah. And I'd like to hear all of that. But let's start with the introduction.

JONATHAN:

Okay. Yeah, hello. It is it's great to meet you, CRAIG. My name is Jonathan Wall. I am the founder of Runloop AI. I suppose a CEO as well. I cut my teeth early in my career at Google. I joined Google not long after it went public, where I was the tech lead of the Google file system. So that was the seminal early days of huge scalable distributed systems. So I had a front row seat and seeing like kind of an explosion of the internet being realized and new technology being built for it. Around 2009, another engineer and I went out and founded the Google Wallet. So that was the first NFC solution for tap and pay. When you see people tapping phones to pay, that was the technology we created in 2009. And at the time I didn't think about it in as much detail, but it was picking up on the next big platform revolution in technology at the time, which was the mobile phone and the kind of the new capabilities that form factor enabled. I went from the build-out of the internet and at scale infrastructure to working on payments and mobile. From there, I left and started a company with another gentleman from the Google Wallet effort called Index. And we did secure payments infrastructure for tier one retailers. And that company was ultimately acquired. It was named Index. It was ultimately acquired by Stripe and became the Stripe terminal product. So Stripe's car present product. So I spent a little bit of time at Stripe. And after leaving Stripe, I found the next big platform change or the next big technology revolution of AI. And thought about how I wanted to participate in this particular revolution. Given a lot of my background as an infrastructure, I thought there was it would be interesting to apply the lens of infrastructure to kind of the coming agentic revolution. Got together with some former teammates from my previous startup and hired some new folks to start Run Loop. And we're really trying to build a platform as a service to be the place where you run these agents. And much like the infrastructure build-out of GFS back in the day was necessary for the internet. We think that agents work differently enough and have different enough compute patterns that they need their own new compute primitive. And that is core to what we're building here at Run Loop. We call that our dead box.

CRAIG:

Yeah. And there's been this kind of explosion of meta platforms that sit on sit under agents on top of foundation models to orchestrate and have agents talk to each other or talk to models and all of that. I'm familiar with Boomi. I don't know if you know them. I don't think I know that one. Yeah, and there's another company. I just had a conversation with them yesterday called DevRev. And they have a product called Computer. And a lot of these, there they have a platform where you can build agents and then manage agents and then have agents work with each other. Is how does your company fit into that emerging infra ecosystem?

JONATHAN:

Yeah. So you're pointing out that there's a few different kinds of layers to this stack. So there are some folks that that make agent builders, and that is super useful and super necessary. Some of them are paired to different frameworks, like the folks at Langchain have a cool agent builder that is tightly coupled to Langchain, for example. But what we're playing is a little lower. It's a little closer to the infrastructure layer. And we ultimately want to be the runtime where the agent is deployed and where it executes. I think if you contrast the needs of an agent to that of a traditional server, traditional servers, they get a request. It's usually schemed, right? People use REST or maybe GRPC and protobuffers and REST and JSON, right? It executes some deterministic code. Usually it'll store some state in a database that also is strictly schemed. And the way agents work is a lot different. Agents, they need access to a lot of tools, they need access to context to perform their task. And they have unpredictable behaviors compared to traditional deterministic code. They might use a lot of resources, they might write code for themselves and need to execute it. And what you really need for these agents is a runtime, a place for the agents to execute where that stuff is safe. And you also are able to empower the agents with all the tools they need. And we think that the dev box, which is effectively an execution sandbox, it's more or less giving each agent its own computer, is the right pattern for this. So we can set up the sandbox with security boundaries and isolation so that if the agent happens to do something unexpected or potentially dangerous, it's isolated on its own sandbox environment. On one side of the coin, you're saying, hey, I want to prevent the agent from doing something accidentally bad. The kind of the flip side of the coin is you can say agents perform better when you give them tools. If I give this agent its own computer, that is effectively like the most powerful tool. And by having its own sandbox, it can go to the internet, it can download documents, it can learn things it needs, it can parse documents, it can keep track of its own process. Not to anthropomorphize it, but like when you're working, you probably have a notepad next to your laptop. You have your own computer, you can go find information, you can keep notes about what it is you're trying to accomplish, maybe a to-do list and check it off. And we think in this manner, agents work a lot more like people and they really need their own computer to succeed. And that's the compute primitive we're really trying to bring to market.

CRAIG:

And so the agent operates in its own virtual machine, not necessarily a web page, not necessarily a browser, or is it a browser?

JONATHAN:

No, you're correct. This is it's it gets its own virtual machine. So we let people define container images. Containers popularized by Docker are like effectively the de facto way to define a compute environment these days. So we launch a container, but we put it inside of a micro virtual machine so that you can tightly control what it can and can't access and make sure to keep it isolated. But you're correct to observe, it's not a web page, it's not a browser. It has full access to a computer of its own. And it lets it do really powerful things. I don't know how much you've played with some of these agents like Cloud Code or Codex or Langchain Deep agents, but you'll see they routinely use the shell at the terminal environment of their computer. And that's really taking advantage of 50 years of Unix programming that is at their fingertips to just dynamically get things done how they see fit. We think this is pretty important.

CRAIG:

And let me ask, so that's for an individual agent, and I apologize for bringing these other companies into it, but it's a way of me triangulating where we are with Runloop. MANIS, for example, was the first, the Chinese agent platform was the first that I saw that's using a virtual machine, but it's a browser, it's using a browser. Is this a trend of having rather than the agent using resources on the user's computer to have the agent operating in in the cloud somewhere on on a separate compute infrastructure?

JONATHAN:

Yeah, for sure. I think there are many reasons for that. And to just to clarify, with I think with Manus, I believe they actually run the agent inside its own virtual machine, and then they have a web page that is just like the product surface area for how you do it. I see, yeah. But I think each of those agents probably, I believe, does get its own virtual machine. And yeah, I think that this is the kind of pattern that people are going to move towards. You have a nice isolated unit of kind of execution, meaning, hey, here's your container image. You run in a virtual machine. I'm not worried about you getting out and causing trouble. But also, I can put anything into the container image you need. I can make sure you have context and I can make sure you have all the right tools to get your job done. And I think by and large, what we'll start to see, or really what we already are starting to see, is people have some nice web-based UI that then routes to these agents running inside their own virtual machine. And I do think it is rather useful for you to bring up these other companies as examples because you made a point that you're seeing a lot of companies popping up at different layers of the stack. And I think you could even there, there are people above us in the stack, like agent builders and different agent frameworks. There are also people below us in the stack. Like we predominantly run on AWS and GCP, but you can look at people like Oracle and Core weave saying, hey, there's enough of a shift in computing here happening here that bundling GPUs is now really important too. And they're trying to effectively go compete and build new cloud offerings that go head-to-head with AWS and GCP. So across the board, the status quo has been maybe thrown up in the air. And at all these different levels of the stack, you have different people trying to provide new primitives to build towards this new economy.

CRAIG:

Yeah. This layer that you guys are building, it's like taking the manus concept, and not that you took Manus's concept, but in for my mind, and creating an agnostic infrastructure layer that you can then all these agent builders can operate on top of. Is that right? Yeah.

JONATHAN:

So that's a great way to put it. Where we believe that in the future, which is happening now, you'll see more and more people building agents and saying, hey, I will have a product front end. The way for me to deploy my agent is to spin each of them up with their own computer. And that's why our platform is designed to satisfy that use case and to make it as easy as possible. We make it really easy to specify what you want in their container image to then launch these things and connect to them. So we can launch, we have some of our customers launch like 6,000 of these things at a time. And then in terms of connecting to the agent so that your product front end can actually pipe all the way in and interact with the agent. We support cool connectivity capabilities. We have a product feature called tunnels where you can connect to these things via WebSockets or HTTP. You can tell us to open tunnels. By default, we don't open a tunnel unless you ask us to. But yeah, we think that this is an important new primitive. And yeah, excited to be a part of it.

CRAIG:

Yeah. And so you're not building agents yourselves. Is that right?

JONATHAN:

No, it's really up to our users to build their own agent. Maybe let's walk through the life cycle of how someone goes about doing this. Yeah, that'd be great. Yeah. There's another trend that maybe is worth mentioning that's a little orthogonal to us, but is I think somewhat important in that is one of people now starting to move towards agent harnesses. So you have the Claude Code folks who have produced the Claude Agent SDK, OpenAI and Codex has produced the Codex SDK, and then the great folks at Langchain, we've been fortunate to partner with on a few things, have produced a harness called Deep Agents. And these harnesses are very capable kind of agent building blocks that you can build on top of, that you can customize for your particular use case. So I think very a very natural way for people to build agents going forward is going to be to choose one of those harnesses to start coding and to include that thing as a dependency in their agent to code it locally and say, okay, great, we like this agent's good to go. You then need a way to deploy it. You say, okay, I need a cloud deployment. I'll use Runloop, hopefully. You use our APIs to deploy your agent. Maybe you build a front end on maybe a company like Forcel or a company like Railway that makes it really easy to build web front ends. And then whenever someone comes to your site and says, Hey, I want to talk to an agent, it makes an API call over. It starts a dev box on our system for that one agent, and you connect the two to the front end and off you go. And that's it, I believe that's effectively how MENIS works. I obviously haven't seen their source code. I don't think it's open source, but I think that is the product pattern people are going to start building against.

CRAIG:

Yeah. The uh you mentioned uh someone spinning up 6,000 agents, and I'll get to my more run loop focused questions in a minute, but that I've always wanted not always, obviously, but in the last couple of months, wanted to go in and sit with somebody managing a thousand agents or six thousand agents. What does that look like? And what are those agents doing? Because again, just from talking to people, I'm getting the impression of on the enterprise side that there isn't yet the level of trust or reliability among the agents to let them to get 6,000 running at once, unless it's 6,000 employees monitoring their email boxes or something very basic. So in that 6,000 example, regardless of whether that's a hard number, but how effective are these agents that you see people building, and how varied are the activities that they're involved in?

JONATHAN:

Yeah. So these are like I think very important questions. In terms of the varied activities, when we started our company, we really focused a lot on the coding vertical. So we have a lot of the companies that are built on top of us, people building coding agents. So there might be coding agents that attempt to discover security vulnerabilities or attempt to reduce technical debt in your code base by writing unit tests and upgrading dependencies and doing things like that. So there's a wide array of kind of companies of that sort. Increasingly, we're starting to see people building like financial technology agents as well, where they're trying to solve finance type problems of given some set of documents from an S1 filing or something. Can I extract useful information and stuff like that? So it is pretty open-ended, it is pretty broad. So those are the types of problems with some folks in crypto, some people in health tech. So it is widening. There's a dispersion of use cases on our platform. But you asked another really important point that I think is something that the industry is still getting its hands wrapped around, which is one of accuracy and of being confident that your agent is producing useful output. At the core of our platform is this notion of a dev box, which is effectively an isolated execution environment for an agent to do stuff. We actually built a product on top of that called benchmarks. And the purpose of the benchmarks is to help people with the accuracy problem. And what the benchmarks are is let's say that one of our customers is building a security agent for the sake of argument. You create a benchmark based on a dev box in a known state with a known desired outcome. Meaning, given a dead box that started in this state with a code base, I know ahead of time there's two or three vulnerabilities. Let me run my agent against this thing again and again, and then let me verify that it always finds those vulnerabilities. So the benchmarking solution that we have lets our customers build domain-specific tests to verify their agents are routinely succeeding at the types of things they're trying to get them to do, like in the wild, in production, and in live production traffic. And I think it's a pretty important part of the development lifecycle because hey, Anthropic just dropped Cloud Opus 4.5. Presumably it's better. All the benchmarks they release seem to indicate it's better. But how do you know if I switch my model or my agent that's talking to that model, will it perform better? If you've been running these benchmarks as part of your as part of your production life cycle, you can change which model you're using and see if your scores get better or if who knows, they might get worse. And so we think the benchmarking is actually a very important thing just for normal development as well. Even if you were to do something as simple as say, I'm gonna change the prompts for my agent to try to make it friendlier. Maybe someone said my the agent's a little abrupt. Let's make them a little friendlier or her, I don't know, a little friendlier. You would think that encouraging the agent to be friendly wouldn't change its underlying behaviors, but you don't know, right? What you're doing is taking your agent's entry point into a large stochastic model and perturbing it. You're shifting it, you're moving around how it enters that large stochastic model. And you don't really know what the side effects are going to be. And so we do think that the benchmarks are very important. And I would say that some of our more advanced practitioners on the platform use the things like the benchmarks, but then they also have application layer things that they do to try to improve accuracy. So they'll have an agent that tries to solve a task and says when it thinks it's done, and then they'll have an agent that runs after it that tries to verify it actually did its job and then tells it to start over if it thinks it was wrong. So there are a number of ways people can push towards accuracy, but we think having something like benchmarks and then just constantly measuring your accuracy is a key point.

CRAIG:

Yeah, no, that makes a lot of sense. And in again, going back to this society of agents, as somebody I spoke to called it, where you have a thousand. Agents in an enterprise or 6,000 agents doing stuff. Is your sense that most of those agents are on call for employees to use, like a coding agent? Like it's not running on its own and performing tasks in the enterprise. Or are there agents that do run continually on their own, maybe watching for expense reports to come in and then handling them?

JONATHAN:

Yeah, it's a great question. I think out there you see basically a mix of everything, right? I think the human in the loop, human-driven, hey, well, collaborating thing is a very popular pattern. Like you see that with Cloud Code and with Codecs and stuff like that and Gemini CLI, right? Where you probably have a human that's really driving the like work input and evaluating the work result. But the agent is kind of working at behest of a person. But I think similarly, as you were suggesting, there are all kinds of systems out there that have events they throw off. So like a ticketing system can fire an event. Most ticketing systems, like Zendesk or something like that, when someone files a ticket, it'll email the support team or something like that. That can easily be a trigger that invokes an agent and gets the agent to try to resolve an event as well. We speak this, some of our customers have agents that fire every time you push a change to GitHub, it will fire a request, start an agent, and have the agent review your code too and see if there are any obvious mistakes. We also have customers that that hang their agents off of Slack. So you can say, at agent name, hey, can you help me with some problem? Then that's yet another source for a trigger. I guess that's a little more human-driven, but I think you're really seeing a blend of everything. And I think it'll probably continue that way.

CRAIG:

Yeah. And your benchmarking layer is really important to make sure that the agents are performing accurately. But presuming that accuracy improves, do you see a day when agents are really the underlying fabric of the enterprise where you have them handling all the back office stuff, all the quality assurance, whatever, depending on your domain?

JONATHAN:

Yeah, I think it'll be yes, but I think it'll be like a per workflow sort of a thing. And my guess would be that probably what will happen first is you'll have agents do 70, 80% of specific types of workflows, and you'll probably still have a human being that audits or approves a final result. An interesting observation is that in the coding space, so if we set aside agents and AI for the moment, when I go to write some code, I have in my mind, or maybe someone filed a ticket and assigned it to me. I have in my mind some sort of a change I want to make. I'm going to go make the change. I probably make sure it builds and it passes all our tests, but then I assign it to one of my peers to review. So someone else on my team says, okay, John thinks he solved this problem. Let me look at his code. And then ultimately that other person approves it, and then the code gets landed and merged. I think that you might see those types of patterns, but where instead of me writing the code or me performing some action, it's maybe an agent that does most of it. And then there's a reviewer that's okay, I trust this agent mostly, but I'm going to look at the conclusions they made. I'm going to look at how they did this work. Maybe I'll look at the input requirements, and then I'll approve it. And I think this is an interesting component that's going to need to be added to a lot of enterprise workflows, which is this kind of acceptance, review and accept stage, right? It's very natural in the kind of coding tooling ecosystem, but I think it's going to need to be added to a lot of the broader enterprise ecosystems where maybe a Zendas ticket comes in, maybe an agent reads a ticket, looks at some knowledge base, does a bunch of work, proposes updating the customer's record in Salesforce. But before it gets to do that, a human being is yes, no change. So I do think this is coming, but I think it'll be selective workflows first and probably human audited a lot of the time. And then eventually there'll be the workflows where you have such high confidence that instead of the human being requiring every reviewing everything, they only review 10% of things or spot check, something like that.

CRAIG:

Yeah, and most enterprises that exist today are trying to adopt agents simply because of the competitive pressures. But there are other startups in industries where there are legacy players that are dominant. Do you see agent native companies coming up? And is it possible to challenge a legacy player with a predominantly agentic workforce, do you think?

JONATHAN:

I think so, and I think it'll just become increasingly easier.

CRAIG:

Yeah, it's amazing. So, in the way that the internet forced a reshaping of the economic landscape, knows that it adopted it early, figured out how to use it, and then some startups coming in and challenging legacy players, it really some big players faded away and new players arrived. Do you think we'll see that kind of churn in the economy?

JONATHAN:

I would guess so, yeah. And I think an interesting place to look at this is maybe the Series C and D companies that were born before AI, but you can look at how rapidly they're adopting it. So you can look at a company like Notion or a company like Figma, right? So these both these companies pre-day and AI agents being really a thing. And they both were challenging larger incumbents, but both of those companies being still rather young and nimble, have supplemented their existing product with AI. And Google is doing this all across the G Suite stuff. Microsoft is trying to do this with copilots everywhere. So there's a little bit of a foot race here for the incumbents to say, okay, how can I like supplement and improve my product with agents to fend off that challenge? Versus there will be new startups that are just more purely agent first that can grow rapidly and maybe rock the cart a little. Cursor would be an example of this, right? Where they were born of the AI era and the they have a really cool IDE that is agent agentic first, and they're making some waves of growing quite quickly.

CRAIG:

Yeah. Can you walk us through a con you did a little bit before, but a concrete example of an agent job on Rum loop from spin-up to running tools to shut down so people can picture what the platform actually does under the hood?

JONATHAN:

Yeah, sure. So maybe I'll give an example of maybe someone who runs, I don't know, has like a code review tool. I think the important piece of this is to is it's important to call out that there's still a product layer for these companies that exist outside of Front Loop, right? This is really where the agents are running. So let's say that you are starting tomorrow and you're gonna build CRAIG Review, right? And this is your code review agent from CRAIG, right? So you might use the Cloud Agent SDK or Langchain Deep Agent or the Codex SDK, or you might just hand roll your own agent. Who knows? I don't know how ambitious you're feeling. Once you've built that agent, you're gonna get it ready to run on Run Loop. So we have an agent API that lets you upload your agent and store it on our platform. You can also bake it into a container image. So it's just part of a container image. So now that you have built your agent, you figured out how to deploy it to the run loop runtime, anytime you want, you can spin up a dev box and point it at a code repository and a review you want it to go annotate. This is good code, this is bad code, you should do this other thing, right? But now you need to build the product front end for your product. Maybe let's say that you're also a fan of, for the sake of argument, Vercel and Next.js. So you might build your web front end where people go sign up and put in their credit card or whatever, and then they give you authorization to receive web hooks from GitHub. And then every time someone opens a pull request on a GitHub repository, you get a web hook at the product layer. You're gonna receive that web hook and be like, oh great. I'm gonna stand up a dev box on Runloop and tell them to code mount this repository at that review. And I'm gonna invoke my agent to say, hey, is this good or bad? Provide feedback. The agent will turn about doing its thing. You might choose to give the agent access to push its updates to GitHub directly. You probably would, so it can comment on the pull request. And the agent will go about its business, it will comment on the pull request, and at some point it will exit, it'll signal that it's done. And your product layer is presumably talking to the agent. Oh, you push your updates, you're done. Okay, great. Shut down the debt box. And now you have a front end that has user sign up and some UX for the user. How many pull requests have I processed in the last hour or whatever, like that? So you have your front end and your product experience. But when it comes to actually launching and executing the agents, you've done that on top of Run Loop.

CRAIG:

Yeah, that's that's amazing. You've in some of this stuff I've read, you say that agents change the shape of compute. Can you explain what you mean by that and the biggest way agent work for workloads differ from classic LLM API calls?

JONATHAN:

Yeah, yeah. I think I I we touched on this a little earlier, but I think it's really worth diving into, which is if you think about a traditional server, right? It has a structured API where it's getting schema data, it's storing output in the database in a deterministic fashion, right? That's how most people run their services these days, right? When you launch an agent and you ask it to do something, it's very open-ended. It's much more like how a human being solves a problem. So let's say, let's go back to our code review agent. Let's say it sees that you're making a change, you're adding a GitHub workflow, right? The agent might say, I don't know if the way the author of the code is writing this GitHub workflow is legal. Let me go download all the documentations, documentation from GitHub.com on workflows. I'm gonna parse that document. I'm gonna read it. I'm gonna understand it. I might write some intermediary notes for myself. That's very different than a server getting a request and writing something to a database like a traditional compute pattern, right? And there are the negatives, I maybe negative is not the right word, but there's the security and isolation reasons why you want to run inside a sandbox. Hey, if the agent were to accidentally down something, download something malicious, you don't want it to infect your organization. So that's like part of the reason why you want an isolated sandbox. But I think the kind of the more positive or constructive rationale is giving agents tools makes them more effective. There is no more powerful tool than a computer. And saying you have full access to networking, the local file system, even bash. Hey, if you want to write, if you want to write a script to make sure the code in this pull request is parsable, write a script, run it. No problem. You have your own computer. So I think that's really what we're talking about here is the agents have unpredictable kind of CPU and memory utilization. They might do things that are a little dicey or dangerous that need to be isolated, but then ultimately giving them a computer as a tool is the most powerful tool you can give to an agent. And standing up random compute instances on a task-by-task basis, like an open-ended compute instance like this, is a little unusual. So that's why we see this sandbox as being like a new compute primitive, and one that isn't there is no elastic sandbox in AWS at the moment, nor GCP. And we think within a few years, this will be status quo pattern for running agents.

CRAIG:

Yeah, and when you say elastic, meaning you can spin up as many as you want, or they you can put as much as you want into the container.

JONATHAN:

That was just a little tongue-in-cheek, and then uh AWS calls me. Yeah, elastic. Elastic, this, elastic, that, elastic, the other thing. But you're also it is also succinct and correct, and that is a feature of a platform that you want, right? You want to be able to stand up a bunch of these instances, throw them out when you're done, scale it up, scale it down. So that it is an important feature. Maybe it was both tongue-in-cheek and accurate.

CRAIG:

Yeah, okay. And in your case studies, you talk about cutting reinforcement learning and fine-tuning cycles from weeks to hours. What did uh customers actually change in their workflow once they had run loop and the built-in benchmarking tool to reduce that RL and fine-tuning time?

JONATHAN:

Yeah, so when you're doing like uh an RFT, you're effectively trying to solve the same problem many different times with slight variations of your model. And you're trying to figure out if I turn the model a little this way or that way, or if I effectively adjust my weights, does it improve the model's ability to solve a problem? And in the particular case study you're talking about, we made it possible just by again having this property of being able to run many dev boxes at once, we made it possible for our partner in that case to try many different model variations at the same time. And our benchmarking product lets you start in a known product in a known state, have your agent try to solve a problem, and then we apply a scoring function. So when our partner was doing this RFT work with us, they would launch many dev boxes, each kind of pinned to different model variants, and then they would use the scoring function to ascertain whether or not a model variant was good or bad. Did it score better? In a way you're doing you're attempting to find the optimal weights of the model for a specific problem. And it's really a matter of our ability to run many dead boxes in parallel and then to also run the scoring function on to return the result that that helped there.

CRAIG:

Yeah, that's interesting. And because that's a whole use case in itself, right? Otherwise, how would someone optimize RLM fine tunes?

JONATHAN:

Yeah, they would have to build some alternate framework or do a lot of the stuff that we do out of the box. They'd have to figure out on their own. We also did a pretty cool SFT with OpenAI, which was quite cool. So supervised SFT, supervised fine tunes, it's a little bit of an easier thing in that you run against something like a benchmark and you treat the scoring result as a data labeler, and you say, okay, here are all these times I attempted to solve the problem and got a score of zero or a low score. Here are all the times I attempted to solve the problem and I got a high score. You can gather that labeled data, and that's like a supervised set. It's like, hey, like here's it's almost like getting a bunch of exams before you take your final from the last few years and seeing the correct answers and the wrong answers and learning from that. SFT is a little bit easier in that you can passively gather this data by running against benchmarks and then use it to produce a LoRaLARE on an LLM that is tuned with knowledge of your specific problem set. That's a little bit easier. It's not quite as powerful as RFT, but it's a bit easier to do, at least from my experience.

CRAIG:

And how costly is doing that if you want to spin up, I don't know how many dev boxes talk about, but it depends on how many you want to do.

JONATHAN:

I think that the SFT is probably an order of magnitude less expensive than the RFT. And the SFT I think is much easier. I think in general, I think there's so this is I'll freelance a little here with my opinions. I think in general, most folks that are building agents should use benchmarks and tune their agents and then pay attention to picking up the newest models and trying to make sure they're on the newest and latest and greatest models. I think once you've hit a point where you maybe have a large concentrated volume of traffic, maybe it's time to consider doing like an SFT. And I think it's just in terms of the complexity and cost, it's probably worth doing RFTs for the like most high-leverage, high business impact, high throughput use cases. So it's really just a matter of like how much effort versus how highly utilized will my improvement be, I guess. Is yeah, is that would be like my personal rules of thumbs to I'm sure there are other people out there who have slightly different opinions or or would do things differently, but that would be my rule of thumb.

CRAIG:

And so you're building this. Who is the target market? Because somebody like Google, OpenAI, AWS, Microsoft, they presumably have infrastructure built in their systems to do this stuff.

JONATHAN:

To varying degrees, yes. Yeah, but these companies are so large and they have so many engineers. Yes, they can figure this or something like it out.

CRAIG:

But enterprises, this is uh not for people like me who want an agent to do one or two things and I can get off-the-shelf solutions. This is for enterprises, complex enterprises that want to be serious about adopting agentic workflows. Is that right? And have a team of dedicated engineers who are going to be building the agents, identifying business cases, building agents, deploying them who need that infrastructure layer. Am I getting that right?

JONATHAN:

Yeah, you are. I would say that when we first started, we were mostly servicing the series A type companies. So people out there trying to build agents as a new business. Over time, we've moved up the value chain and we're starting to have series B and C and D companies that are adopting our platform. A push for us going into next year is going to be adding the capability of what we call deploy to VPC. And that means being able to take our solution and deploy it into the cloud of another enterprise. We think this will be really important for unlocking like the larger enterprises, where larger enterprises are interested in this, many bigger companies. For compliance and security and data privacy reasons, are very they're a little they're a little scared about using a shared public cloud platform such as ours. So if we can take our product shrink wrap and drop it into their environment, we think that they then get to enjoy all the benefits without while having the peace of mind and the kind of compliance stuff that they need to have in order. So it's like kind of an important focus for us. I think next year we'll be moving towards that enterprise customer.

CRAIG:

Yeah, I see. And so when you say Series ABC, you're talking about companies that are building agentic products or companies that want to build an agenc enterprise to support whatever product they're presenting to the market, not necessarily an agent product.

JONATHAN:

Yeah, it could be that a lot of the Series A are the earlier companies or AI native, newer companies that are trying to purely build products that are agents. Some of the more mature companies that are a little later stage are saying, hey, I have an existing product, it's successful. I want to supplement it by adding a little bit of agentic magic to it.

CRAIG:

Yeah. So the uh as these systems develop, and I think uh there's the jury's still out on whether or not five years from now we'll be living in an agenc economy or whether agents will become useful in the enterprise, but not transformative. How do you see forces changing? Because this if they agents are the future, it's gonna have a big impact on how workforces in the enterprise are structured.

JONATHAN:

Yeah, it already does. Like, I don't our company is only 16 people. We're almost all engineers. Every one of us uses coding agents to help us be more productive. And you know, what's interesting is different people use different ones. Some people really like to use Gemini, some people really like to use Claude, some people roll some of their own agents there for particular use cases that are very common. I've actually built some of my own agents to evaluate things. This is already happening. It's what's the saying the future's here? It's just not evenly distributed. Engineering has been has already engineering was the first place that this has really taken place. So I think there's probably not many engineers out there that aren't using these things, unless for some reason their workplace won't let them. And I think kind of the point I just made from my own experience is I think a valid one. I think if you're maybe remove the engineering label, but if you're someone in HR or some sort of administrative part of an organization, you're probably gonna have a few different agents, and you'll probably sit next to someone who has a pretty similar job that uses a different agent and works with it in a slightly different way, but they get more done because they're working with it. And you might choose a different agent and have a slightly different workflow that you're happy with. So I think these things are gonna be end up being kind of part of the workforce, kind of part coworker. And it's gonna be different teams will use different things too. If you think about an engineering organization, like it's not flat, right? There are people who focus on the front end, there are agents that are great for the front end, there are infrastructure people who maybe use different things, there's security engineers, ops people. There are a number of agents that are specific to ops. So I think about it as having new coworkers, and you can choose which one of your new co-workers you're going to collaborate with depending on what it is you're trying to do. I'll be my guess.

CRAIG:

Yeah, and that's actually really interesting. I hadn't thought about that because I assume when you said that you'll be using an agent, someone with a very similar job may be using a different agent or using the agent in a different way.

JONATHAN:

So our desks are lined up, and I have a person to my left and right, and then obviously people behind and stuff, and no one of us uses stuff the same way.

CRAIG:

Yeah, that's interesting because I always assume this is will be like some higher-up corporate policy. These agents are going to do this, and the employee will have to work with the agent to do that. But in fact, it's these agents are available and they do things, and the individual employee will figure out how to best use them for their workflow.

JONATHAN:

Yeah, yeah. It comes down to personal stuff. I'm sure in your industry, I'm sure you you have a particular way you like to go about researching things and then choosing about how to structure your writing. I bet you have colleagues that write that are maybe peers of similar stature who have a totally different process. Yeah. I would guess that's true. But they arrive at an equally valuable result just in a totally different way. Yeah. And I, yeah, uh like for me when I work with agents, I like to go, I know what it is I want to do. I like to go sketch out the bigger code changes and make sure it looks right to me. And then be like, all right, agent, fill in the blanks here and then here, and like along the way, I'm reviewing assets writing. I'll be reviewing the code as it goes along. And sometimes it finds something I didn't foresee. It's wait, you set it up like this, but how's this one other thing gonna work? I'll be like, whoops. Other people do just work totally differently with agents. Yeah, I think it's interesting.

CRAIG:

Yeah, so run loop you're moving into sort of the large enterprise space once you have the the project.

JONATHAN:

Yep.

CRAIG:

Yeah. What how is the company? It's amazing, only 16 people. How do you see the future? Is it do you see this as wide open, sky's the limit, or do you feel you're not sure you know what direction the market's gonna take next? You're placing your bets, a different paradigm may develop.

JONATHAN:

I don't know how long the enterprise is gonna take. We certainly are continuing to support. We have a monolithic deployment right now where all our customers run. And we will certainly continue to try to grow our business there and on more customers and add more capabilities. I do think for the broader kind of AI and agentic thing, that ultimately having some enterprise success is is going to be pretty important. I do think I mentioned earlier, like the advent of these kind of more batteries included agent harnesses like Cloud Agent SDK and Langchain Deep Agents and Codecs SDK. I do think those are important things that make it a little more accessible and easier for people to build successful agents. But yeah, so we will continue to try to grow our current core business and then also attempt to try to be part of the enterprise story.

CRAIG:

Yeah. And is there anything I haven't asked that you want listeners to hear? And then I was gonna close with a call to action. What if people want to check this out? What kinds of people should check it out and how would they do so?

JONATHAN:

I think just go to our site, ww runloop.ai, and you can read up on our site and you can sign up. Sign up is free and you can start using the system. I think I think really everyone who's curious about this should go play with agents. I guess the other thing I would encourage people to do is choose any of these kind of frameworks I've mentioned. Choose Langchain Deep Agent or Cloud Agent SDK or Codex, or just use those canned agents and just go play with them locally, then go think of something to do that you want to make custom for yourself, build it and give it a shot. It's not that hard. Yeah. I think more now than ever, I don't think this genie is going back in the bottle. I think it's behooves everyone, even if you're not going to be like someone who's writing agents all the time, to be like aware and to have kind of some touch and feel experience with them.

CRAIG:

Yeah, I agree entirely, and that's part of the purpose of this podcast. I used to have a tagline at the end, which I don't say anymore because it seems a little obvious. AI is changing your world, so pay attention. Yeah. Even if you don't think it's relevant to you, this is the world we're entering.

JONATHAN:

And it is. And just uh from we started this company in the beginning of 24. And at the time, pretty much I pretty much wrote like 100% of my own code at the time. And kind of 2024 was a year where people started to notice if I ask ChatGPT questions, it'll produce code and I can like kind of copy paste it out. And you maybe had some chat GPT like assisted writing. And I think from there it went to the cursor where it was like, hey, it's just in your IDE, but it was still early, right? You still had to do a lot of stuff yourself, and the quality of the code produced was maybe not awesome. To now, God, it's like these agents are capable of writing like 70, depends on the engineer and depends on the problems you're solving. Between 60 and 90 percent of your code, they can write. You have to go direct and clean up and sometimes restart and say, look, do it this way. This is the right way to do it. But just the change in 18 plus or so months is just stunning. And yeah, I expect that is ultimately gonna happen in other disciplines and verticals.

CRAIG:

Yeah, yeah. And on that coding, I always ask this because I'm not a coder. How long do you think before you have a product that has a systems, system of agents of these very capable coding agents that can check each other's work? So you give the top line instruction in natural language, maybe the model, there's some back and forth for it to clarify areas that it doesn't understand, and then for it to go off and code and come back with clean executable code without any human intervention.

JONATHAN:

Depending on the complexity of what you're doing, that already happens. If you're doing if you wanted to build a reserve a soccer field website with a database and a calendar that people could go put their email in and reserve a soccer field or something, and then host it on Vercell with a database, they can do that right now. From with a cool UX, with all database persistence, with maybe something like OAuth where you have to sign in with email account or something like that. It can do that all on its own right now. So really the question is what you described is already possible. The question is how quickly will that frontier of like complexity expand?

CRAIG:

Yeah. Yeah. Okay, Jonathan. This has been fascinating. And I'm gonna check it out. I hope on our hope our listeners will.

Sonix is the world’s most advanced automated transcription, translation, and subtitling platform. Fast, accurate, and affordable.

Automatically convert your mp3 files to text (txt file), Microsoft Word (docx file), and SubRip Subtitle (srt file) in minutes.

Sonix has many features that you'd love including world-class support, upload many different filetypes, enterprise-grade admin tools, powerful integrations and APIs, and easily transcribe your Zoom meetings. Try Sonix for free today.


Learn more
 
blink-animation-2.gif
 
 

 Eye On AI features a podcast with senior researchers and entrepreneurs in the deep learning space. We also offer a weekly newsletter tracking deep-learning academic papers.


Sign up for our weekly newsletter.

 
Subscribe
 

WEEKLY NEWSLETTER | Research Watch

Week Ending 1.11.2026 — Newly published papers and discussions around them. Read more