Clarifai Founder and CEO Matt Zeiler will explore the shift in focus from training to inference, and how a unified compute orchestration layer is fundamentally changing the way teams run Large Language Models (LLMs) and agentic systems.

 
 
 
DOWNLOAD TRANSCRIPT

304 Audio.mp3: Audio automatically transcribed by Sonix

304 Audio.mp3: this mp3 audio file was automatically transcribed by Sonix with the best speech-to-text algorithms. This transcript may contain errors.

MATT:

It started with just inference back in 2014. The the reason we started there uh and having an API platform was inspirations like Stripe and Twilio, where they b built a great developer product and people just built amazing things on top of that. And they became kind of the engine for that. And we saw that there needs to be this engine for AI. And so that's the same product on the commercial side as public uh sector side. We intentionally built a whole suite of UIs that make the system easy enough for anybody to use. And because of that, we've had even intelligence analysts train models themselves, doing all the labeling, uh picking a training template, looking at evaluation metrics of how they perform without our teams uh being involved. That's key to making the APIs successful, which is ultimately what you're gonna use. Once you have a good model, you're gonna use the APIs to fire through lots and lots of data uh for inference. But to get to that model, the UIs are really important.

CRAIG:

In business, they say you can have better, cheaper, or faster, but you only get to pick two. What if you could have all three at the same time? That's exactly what Cohair, Thomson Reuters, and specialized bikes have, since they upgraded to the next generation of the cloud, Oracle Cloud Infrastructure. OCI is the blazing fast platform for your infrastructure, database, application development, and AI needs, where you can run any workload in a high availability, consistently high performance environment, and spend less than you would with other clouds. How is it faster? OCI's block storage gives you more operations per second. Cheaper. OCI costs up to 50% less for compute, 70% less for storage, and 80% less for networking. Better? In test after test, OCI customers report lower latency and higher bandwidth versus other clouds. This is a cloud built for AI and all your biggest workloads. Right now, with zero commitment, try OCI for free. Head to Oracle.com slash EyeonAI. EyeonAI all run together, E Y E O N A I. That's Oracle.com slash EyeonAI.

MATT:

I'm uh Matt Zealer and founder and CEO of Clarify. And uh my history in AI goes way back even into undergrad. That's kind of where it started a little bit by luck. Uh I was at University of Toronto and in a program where you have to decide if you want to take, you know, many different options of engineering. And uh I was deciding between the computer option and nanotechnology, and that's when I happened to run into one of Jeff Hinton's PhD students, uh guy named Graham Taylor, and he happened to be my resident advisor on the floor I was living in. So that was the luck of it all. And he showed me some of his research, and back then he was generating videos that look realistic of a flame flickering, and he said it was all done by AI. And so, fast forward to today, everybody calls that generative AI. This was 2007. Um, so we've been doing this stuff for a long, long time, well before Chat GPT, uh, where everybody kind of realized its value. And so that got me hooked back then because I learned how to program before that, and there was no way I could program in a traditional sense to generate a video uh like that. So I decided to dive into the computer option, uh, took Jeff Hinton's course, and then I ended up being uh Jeff Hinton's first undergrad thesis student uh ever. So that was quite the honor to work with him. Um, and I did a thesis on actually generating motion capture data for uh for biological studies of pigeons walking around and how they behave. Um, so that was a cool undergrad, and by that point I knew that I had to uh you know focus on AI. It was going to change a lot of things in the world, and I didn't know enough coming out of undergrad, so I decided to do a PhD and I looked at all the different schools and uh chose NYU to study with people like uh Rob Fergus, who is my PhD advisor, and others like Ian Lacan uh in his lab. So got to work with both of them and focused on computer vision for my PhD, applying neural networks to understanding images and videos. Uh and that was between 2009 and 2013. During that time, I also spent a couple summers at Google Brain and worked under Jeff Dean on the Brain project. At that time, it was like 20 or 30 people. Uh, so it was really uh fun to work there. And that's where I learned how to scale up uh AI at the large scale that Google operates at. Um, but I left my second internship a couple weeks early to come back to New York and start working on building a company called Clarify because what I realized is the work I was doing in my PhD was actually working better than what Google had internally for understanding images. And I saw that as an opportunity to uh start a company around this. And uh it's been a childhood dream of mine to start a company. It's kind of since undergrad, I knew it would be in the AI space, and so this was kind of my my shot at that, having really good research. And I incorporated in November 2013, and about three weeks after incorporating, ended up winning ImageNet uh with the the clarified name. And that was the biggest computer vision competition at the time, and this was the year immediately after AlexNet uh won ImageNet, which kind of put deep learning on the map as actually being an effective method uh in the first place. So it was great to be the first in the world to reproduce his results and then uh surpass them. And it was a great way to kick off the company because that got inbound interest from day one from investors and and customers. Uh so that's kind of the origin story for clarifying and I can go into more about what we do, but I'll pause there.

CRAIG:

Yeah. And on that ImageNet competition, was it you and and a team that you'd pulled together for the startup, or were you still working uh through academia?

MATT:

So good question. I did actually two entries. One was with my PhD advisor, Rob Fergus. We call that ZFnet, or I think other people named us that uh for Zealer Fergus. Uh, and I got third place in the competition. Uh, somebody else got second, and then Clarify, uh, which was only me at the time, got first place. Um, there was a few different techniques that I was doing as part of the company separate from my work at NYU.

CRAIG:

Yeah. Ah, that's amazing. And uh and then from there, uh you very quickly clarify very quickly grew into uh one of the first and one of the most influential uh image recognition uh companies, and you you began working with uh law enforcement and and uh uh the Pentagon on Project Maven. Can you talk about that? Because I remember at that time uh I or or around 2017-2018 when I really started focusing on AI, I was I wanted to talk to you guys, and you were like, you know, not available to talk. And I I don't remember what it was. Uh I I think that there had been some sensitivity around the law enforcement applications or something that and you guys just didn't want to be out in the media. Can you talk about that period?

MATT:

Sure. Yeah, so kind of between starting the company at the end of 2013 to that period, we built uh a lot of products around computer vision, really a horizontal platform. We had our first APIs up in 2014 serving inference traffic. Uh, we started by building what people call foundation models today uh for computer vision, trained to recognize tens of thousands of things in the world, so you don't have to train them yourselves. And we built a few of those for different use cases, like wedding photos, food. Uh, our general one is the most popular still to this date, uh, general purpose, like social media, consumer photos, all that kind of stuff. Uh, content moderation, filtering it, drugs, nudity, weapons, another big popular use case. And then in 2016, we kind of allowed the platform to let people train their own models, uh, label data and train their own models. And uh so people could fine-tune it for their particular use cases. And shortly after that is when the government actually approached us to be part of Project Maven because they wanted to see if they take a platform like that and have it fine-tune on intelligence data. And uh they got a demo of our product. I still remember this. It was one of maybe the only customer meeting I had on a weekend. Uh, the the leaders of Project Maven came up to New York to our offices on a Sunday uh carrying a military bag and walked in and had me demo for an hour. And after that, they're like, Yeah, you definitely need to be part of this program. And we were under contract within a month of then. Well, uh, still our fastest government contract since, actually. Uh, so it was a great experience uh with the original leaders of Maven because they took the approach that you know uh an enterprise takes in the commercial world. They evaluate different alternatives, they do their research, they find the companies that are doing cutting-edge stuff, uh, they talk to them, evaluate them, and then pick and uh and get them under contract quickly so that they can uh see the value from it. And so that was our experience with Maven. And uh, you know, working with the military did raise a few questions internally. Uh, and there was other companies, famously Google, what that was part of Project Maven, that uh had similar conversations happen internally. And I still remember it's probably like 2018 when I had an all hands that I just told the company that what we're doing in projects with the government like Project Maven is ultimately gonna save lives using artificial intelligence to make decisions faster and in a more intelligent way than humans can do it. And uh, because of that, we're gonna be focused on this as a company. And some people didn't like that, um, but and one or two quit after that all hands. But what it did was stop all of the gossip, stop all the debating internally, and got us all aligned. And a lot of people after that, all hands, many more than the number that quit said, Oh, thanks for doing that. Finally, we're like past this. I'm like really excited to be working on these kind of uh programs, and I totally agree it's gonna have a big impact. So that was a big uh big catalyst to us continuing.

CRAIG:

And uh in Project Maven, a lot of people think of it just as uh vision-enabled drones, but it it was much broader than that, right? Can you describe uh Project Maven and and what happened to it? I've lost track of it.

MATT:

Uh it's uh it's still going. I was literally just drove over from a meeting uh with a customer right before this. And um I can't go into all the different things that it does, but it broadly became the first AI program within the government, and definitely the first successful one. And because of that success, it gathered other lines of effort, uh, all different types of sensor data you can imagine the military having. They want to apply AI to again accelerate those decisions and make better ones. And so it started with drones, uh, but it's gone to many different directions from there.

CRAIG:

Yeah, and it's it's visual data, or is it anything? Signals, intelligence, or yeah.

MATT:

I mean, it started with visual, but now uh there's discussions about every everything you can imagine uh at this point. And that includes, you know, with with genai in the last couple of years becoming very popular and useful, large language models. Uh, what can you do with them in an intelligence scenario? Uh, that's part of the discussions with with Maven, but also with uh many other programs across the the DOD and intelligence community.

CRAIG:

Yeah. And and the output from uh from the Maven platforms is for uh what what's the primary use? I remember early on there was a lot of talk about targeting, uh and uh but but then there was a lot of uh sort of uh pullback from that by the government saying that it wasn't targeting, it's it's really uh for intelligence uh uh analysis. Where has that gone? I mean, where does it lie now? Is it still with the army? Is it uh or with the the uh the joint uh what is JAIC? I forget what the A was, but uh joint AI center. Yeah, yeah, yeah. Joint AI Center. Is it but is it sort of a joint program across the services? Um and and yeah.

MATT:

No, it's actually transitioned to NGA, um, so it's part of the intelligence community now, and that uh maybe hints at kind of the purpose of it, less so you know, automating target recognition, but providing just better intelligence.

CRAIG:

Yeah. Um, and so that's ongoing, clarify, and then you also had uh products for law enforcement. Are there uh discrete products or were these was it a platform? Did Clarify have a platform or APIs that different agencies uh would employ and and cobble together a system on their side? Or did you have you know a uh or do you have a product that uh the agencies, whether it's law enforcement or not, uh can use to scan faces, for example, or you know, whatever the inputs are?

MATT:

Yeah, so everything is one platform. And as I kind of uh mentioned, it started with just inference back in 2014. The the reason we started there uh uh and having an API platform was inspirations like Stripe and Twilio, where they built a great developer product and people just built amazing things on top of that, and they became kind of the engine for that. And we saw that there needs to be this engine for AI, and so that's the same product on the commercial side as public uh sector side. We intentionally didn't want to build two different uh products as a small company, and what that's allowed us to do is uh we to in order to do that, we had to focus on making the overall platform really uh flexible in its deployment options. So when you deploy with a lot of these government agencies, they typically want to go on-premise, even into air gapped environments, on classified networks that aren't connected to the internet. So all these different parameters go into our builds now so that they're flexible enough to go in those environments all the way up to you know AWS, Google, Azure, et cetera. Um, and that really helped us build a much better, uh better product at the end of the day uh in handling all those things. But we it also focused us on having just that one product had to take it to solve a given problem. There's usually uh collaboration with the customer and our professional services team who can customize things in the product, customize the AI models with the customer, uh, and uh the platform's easy enough for the customer to do this as well if they choose to, um, so that the AI is learning how the customer cares about uh you know, recognizing or generating uh what they need. Um, so flexible enough to deploy everywhere and flexible enough to kind of cater to the problem at hand.

CRAIG:

Yeah, but it's a developer's product, an API into your system that they build into solutions.

MATT:

Yeah, exactly. And on top of the API, we have really simple to use user interfaces. So, for example, data labeling is a core component to building AI systems, and that requires a lot of user interfaces to you know, draw bounding boxes or draw per pixel masks of objects. And then there's UIs for the task management. Like what do you want to label? Uh, who is the workforce that you want to label it with? Uh, is it complete? What's the progress? All those kinds of experiences, they're not really good as an API only. So we intentionally built a whole suite of UIs that make the system easy enough for anybody to use. And because of that, we've had even intelligence analysts train models themselves, doing all the labeling, uh, picking a training template, looking at evaluation metrics of how they perform without our teams uh being involved. Uh, so that that's key to making the APIs successful, which is ultimately what you're gonna use. Once you have a good model, you're gonna use the APIs to fire through lots and lots of data uh for inference. But to get to that model, the UIs are really important.

CRAIG:

Yeah. Um, and then you you you uh sort of shift it into inference uh as a as a service, is that right?

MATT:

So I I would argue we started there and now we're uh uh yeah, doubling down on it again.

CRAIG:

Yeah, yeah, but it but but in uh yeah, can you talk about that? Why is uh I mean I I've had um Rodrigo Liang and uh Andrew Feldman, you know, that's uh Sembanova and Cerebro side. I haven't had uh the guy from Grok on. We missed appointments a few times, but that's on the hardware side, speeding inference through hardware, and you're doing it through optimization in the code. Is that right? Is that right? And so that uh you can beat these inference chips or these specialized accelerators uh without any specialized hardware just simply by optimizing the code? First of all, yeah, why is inference such a bottleneck? Uh because it feels fast to uh to consumers, and then um yeah, how how did you go about doing this without and and and does this would would your inference uh strategies just accelerate inference on a platform like Cerebrus uh that much faster, that much more? Uh or or is is are they is a fork in the road either you go down code optimization, algorithm optimization, or you go down the hardware route, but you can have both.

MATT:

Yeah, great question. So inference is uh good segue from the last conversation. It's kind of the the production use case. Training the model is like a precursor to the inference. Really, what you want is an intelligent system that you're gonna run over and over, get accurate results, get them quickly, or get them at large scale of batches of data and do that in a repeatable, reliable way. And that's really where the the company started back in 2014 uh with our first models. We didn't let anybody train in the platform, didn't let them label or anything. It was just inference. And back then, uh, even going into my PhD, uh, we were one of the first people writing CUDA kernels for AI. Uh, I started getting into GPU stuff um definitely in 2012, maybe even 2011. Um, but at that time, it really started out of Stanford. There was some work from Andrew Ng's lab, uh, and then Toronto picked it up with Jeff Hinton's lab and people like Alex Khrushewski writing some really good kernels. And then uh all there was only four labs doing AI back then. There was those two, uh, NYU, where I was at in my PhD, and Montreal with Joshua Bengio's lab. And so all four of us were collaborating on these these the use of GPUs. And once Alex had some really good uh kernels, we all shared them. And so I remember distinctly in my PhD, I first adopted them for my experiments, and overnight they were 30 times faster. Uh, and that just changed the game. Instead of waiting for results for a day, I could go for lunch and come back and have my experiment done. And and so um, thinking back, I was probably one of the first 20 people in the world actually writing CUDA kernels for AI, which was pretty cool. And that kind of deep expertise was the genesis for Clarify. We had to build toolkits before PyTorch and TensorFlow existed. We had to build uh GPU operators in Kubernetes before NVIDIA was building operators to support GPUs within Kubernetes. So, kind of all the layers up. We have expertise within Clarify on how to build these systems. And so that um realization in the last year and a half was that everybody's still struggling to run these big large language models that have become very popular since ChatGPT. So, why don't we take a lot of this learning we have and a lot of the internal tools that we've built over the years to run models efficiently and at very large scales and open them up as a new product line? And so that's what we launched recently is AI uh compute orchestration. And we call it compute orchestration, not just inference, because inference is just the first workload. You'll see more coming for training, uh for agentic tasks that run in the background for a long period of time, all coming soon. But inference uh we view as the most important one uh because of the scale it operates at, and because of a lot of the optimizations we can do, it benefits inference really, really well. And your second question was do those optimizations work only on GPUs or can they work on other things? They can work on other things depending on the optimization. So we have a mix of low-level things like CUDA kernels uh that we optimize. We have things like moving Python code to C. Uh, there's a variety of different tweaks we do, but there's also tweaks that will work across accelerators. Like we're predicting usually many tokens in advance uh using good speculative techniques. Uh and uh and that should work on any kind of accelerator, uh, you know, a Google TPU, uh Cerebrus, uh wafer, uh, whatever you may throw at it. Uh so it would be a mix. Um and we found that it depends a lot on the on the use case that you're using these models for, how well these some of these accelerations can work. Uh so there's this year, especially there's been a lot of excitement about reasoning models and agentic tasks. That's really what we built the uh the optimizations within the clarify reasoning engine for. Uh, because if you think of these how these models uh reason, they're actually repeating themselves many times. Uh they're referencing a lot of stuff in your large context a lot of times. And so there's certain optimizations you can do uh for those kind of repeatable agentic tasks that uh that perform really, really well. Um, but they won't work as well on on every general task.

CRAIG:

Yeah. And to unpack some of that, uh just you know, I get lost in these conversations and I have so many of them. Uh I'm familiar with with the jargon and and I don't want to slow guests down and asking them to explain things that that uh that the the the more advanced listeners already know, but let's uh explain to people what a CUDA kernel is uh and why it's important. Uh yeah, go ahead.

MATT:

Yeah, so so CUDA is NVIDIA's programming language for their GPUs. Uh I don't actually remember what it stands for off the top of my head, but uh but it's been around for like 15, 20 years almost at this point. And uh it lets you, you know, there used to be languages for gamers to program GPUs to make games, but they're really low-level like shader languages. And so CUDA became a general purpose language, just like you can program any software on a CPU, you've been doing that for many decades. This was like the equivalent for a GPU. And GPUs are just unlock um much more parallelism than a CPU offers because they have thousands of cores instead of tens of cores, and that gives a lot of advantages for uh neural networks in particular, because when you're doing the matrix multiplications that are common in a neural network, it's highly parallel. So GPUs are great fit.

CRAIG:

Yeah. And but then kernel is a function.

MATT:

Yeah, so yeah, and so the kernels are an example, is a matrix multiplication kernel. Uh so multiplying one matrix by another, that uh that would be executed as a single kernel on a uh uh a GPU. And so uh when we're talking about CUDA, we're talking about NVIDIA. Uh AMD has an equivalent now of CUDA called Rockum. So there's Rockham kernels, there's CUDA kernels, and that's kind of the lowest level. Uh and it builds up from there. Uh, a lot of people today don't have to operate at that level because NVIDIA and AMD and Cerebrus, Groc, etc., whoever's making the hardware is usually making really optimized kernels for their hardware so that people can just use the software on top, like uh like PyTorch, for example, which implements the best kernels for NVIDIA GPUs, for AMD GPUs, uh, and and other uh hardware providers. Um, but when you're trying to push the limits of optimization, sometimes you can find techniques uh at the kernel uh programming level that uh are even better than what the hardware providers were writing themselves.

CRAIG:

Yeah, and uh I mean this is essentially what DeepSeq did, right? Uh yeah, yeah, yeah.

MATT:

Deep Seek did a lot of really clever engineering, down to the kernels, all the way throughout the systems engineering, everything.

CRAIG:

And how much of of uh of what so the the hardware uh producers like Grok, Sembanova, Cerebrus, who are focusing on uh on uh inference, their point is that you don't need to go in and uh tinker with uh with the kernels in whatever the programming language of that particular hardware is. We've done that and we provide the super fast inference that you can access through API into our cloud or or on promise if you want to do that. Uh in in the deep seq model, uh it's not the hardware you could use any hardware, presumably, but they're optimizing inference on the software side. Um how influenced were you by Deep Seek? Was DeepSeek just uh the first to make a splash in of you know the zeitgeist was already working in that direction? Uh or or did people see what Deep Seek did and say, yeah, that's right, we can do that. Why aren't we doing that?

MATT:

Well, we were working on computer orchestration about a year before Deep Seek came out, um, just because of the general trend was that open source is uh catching up to the proprietary models. And I think DeepSeq was just a huge catalyst to open everybody's eyes in that same direction. And they did a lot of great work uh on the Deep Seek team, and they published a lot of it too. Open source GitHub repos, uh good technical reports. So kudos to them. Um, and they're not the only ones, like there's been a lot of great work out of the Alibaba teams for the Quen models. Um, there's like the Kimi K2 models, uh, another good one out there. Uh, a lot of great open source models are coming out of China. Uh the one we've focused the most on is OpenAI's GPT OSS. And the reason for that is one, uh, it's built by an American company, and we do a lot of work with public sector, so we can deploy that model on premise without uh any question marks. Uh, but even more important, it's a really efficient model from a compute standpoint. You can run it on a single GPU and get really high throughput. Uh it and because of that, you can offer it for really competitive pricing and it's very intelligent. Um so kind of that combo of cheap, fast, and intelligent is like the the trivecta, I guess, uh that everybody looks for. And I don't think. Anything's even close from those dimensions. Some may be more intelligent, but then they need eight GPUs and that costs a lot. So it's uh it's a great model, and and we're the fastest in the world at at running it because of all the focus we've put onto it.

CRAIG:

Yeah. And so and so you're offering it as a service, or uh I mean uh yeah, it's not open source, the code is not open source.

MATT:

Uh correct.

CRAIG:

Yeah.

MATT:

Yeah. So the way our computer orchestration works is we have some of these popular models like GPT OSS uh offered as a serverless option. So you can pay per token. But if you, and this is what most customers end up doing, if you like a model, or maybe you have your own fine-tuned version of it, or your own custom uh foundation model, and you want to run with some of the optimizations we have, you can deploy into dedicated node pools, we call them. And these are pools of the same type of instance that can scale up and down dynamically based on your traffic needs, and they're dedicated to you. So you don't have other customers making requests in there. Uh, you're in full control to deploy whatever you want into them and control how things scale up and down. And so uh we have both the serverless token-based pricing and the dedicated, which is priced by uh GPU hour essentially, uh options. So lots of flexibility in in how we offer up these kinds of models.

CRAIG:

And that uh dedicated uh GPU pool or or whatever, is that on your cloud? You offer a cloud, or is that in any cloud and and you have ways to ring fence the GPUs, or is it yeah, uh yeah?

MATT:

So we uh actually offer both now. So in our VPCs, you can uh provision clarify compute, and uh they're in our you know overall cloud uh VPCs that sit across multiple clouds. Uh you can also deploy it to your uh own VPC or even your own bare metal Kubernetes clusters. So wherever your compute may live, uh you can run. And the big advantages of that are you get that kind of flexibility. So maybe you start on premise because it's going to be the cheapest if you have the upfront capital to buy machines. Um, but you're kind of by definition at a fixed size when you're on premise. And so you need a way to spill over to maybe a neo cloud as the next option. And then even neo clouds have a limited capacity. If you're really taking off, you may have to spill over to a hyperscaler or multiple hyperscalers. And so you can do that all within the clarify platform without having to retool or re-engineer things for every cloud or on-premise differently. Uh, you get the same kind of performance on uh on chips wherever they may be, and that includes not just NVIDIA chips but AMD, and uh we're working towards support for Google TPUs uh kind of as we speak, so there'll be other accelerators uh offered in the near future as well.

CRAIG:

Yeah, and you were saying that that this can run on any uh hardware, so it could run on Sam Bonova or Cerebrus, but does it accelerate the their inference speed beyond does it accelerate inference beyond the speed that they offer? I mean, is there a cumulative effect or is it uh you know it's it's gonna be what you guys can manage, or it's gonna be what Cerebrus can manage on its own. It's not gonna be some multiple of that.

MATT:

So those platforms we haven't focused on yet. Um uh I'd love to explore that with each of them. Um, but in our experience with new chips, uh there's a long lead time for the software that they provide to catch up and for especially for other third parties to implement optimized software. And so I think Google TPUs are at that point now, um, but they've been in the works for like a decade. And and other platforms I think are really early on how well the software stack will allow other third parties to uh to write good stuff with. Kula's been at it for a long time. AMD's caught up with Rockham. Um, but uh the others it'd be hard, I think, for for somebody like us to squeeze uh performance out. And I think that's actually why you know you see people like Cerebris, they used to sell the wafers um a lot. That was their business. Now they're focused on like serverless. We're just gonna host the models and be really fast at it, and they are they're great, but I think they found it difficult for people to write the software for the hardware that they were selling. Yeah, exactly.

CRAIG:

Um, and uh so where where does this go? Does how does this change your business? Does it just extend your business?

MATT:

Um yeah, yeah, I I think it uh accelerates it because I think the what we found with computer vision is the and we're still named leaders in computer vision to this day, like Forrester named us leaders last year in their computer vision platform report. Uh, we even surpassed everybody, like Google, AWS. So uh we're we still kind of wear that crown, which is awesome to see. But with computer vision, uh, we find that the use case is always very uh specific. And you need some of those professional services to train for a given customer. Um uh so a good example is we've just signed a deal uh last quarter with a uh NFL football analytics company. And they want to use our platform to do uh all the sports analytics that they're doing, uh, but they need our PhDs to help them train the best possible models to do that analytics very accurately. Uh whereas with large language models, you it's kind of the opposite end of the spectrum where the one size fits all model uh is so general purpose that it can be prompted to do so much that the game there, the focus there is if you can build really performant of really reliable systems um that like our clarify reasoning engine is doing now, uh, it's how quickly can you um and cost effectively can you offer it to customers? And so we once we launched our benchmark results on artificial analysis, um, showing us as leaders in the GPT OSS model, we closed a inference deal in five days. I don't think we've ever closed a computer vision deal in five days. Um so uh if that signs to come in a in a repeatable way, uh we're very excited about that uh that new offering.

CRAIG:

Yeah. The um you mentioned token throughput. Is that uh the the the key to inference speed?

MATT:

There's a couple important dimensions, and uh the throughput is one of them, number of tokens per second. The other is usually time to first token. And uh there's even measures, different measures of that one, which is time to first token, period, uh, or there's time to first answer token. And that means with these reasoning models, they do a lot of thinking before they respond with a final answer. And so the time to first answer token is kind of combining both time to first token and throughput. Because if you have to go through all the reasoning quickly, you have to uh have high throughput, but then time to first token is kind of when reasoning starts. Uh, and we're doing really good at both. Uh, and that ultimately gives you the time to first answer, uh, the final token of the answer, uh, which is really fast. And we've had um some customers after they saw these benchmarks, they're like, let me try this ourselves. And they compared us against other popular inference providers out there uh who really focus on similar things, kernel optimizations. Uh, they tout themselves as being fastest at inference. And the customers came back and said we're 65% lower time to first token, uh, 11% faster output token per second. And that resulted in 40% faster time to answer. Uh, and they signed up within five days. So we're really excited to see the real world performance because it's going to depend again on your prompt, your data, the model you choose to run. There's a lot of variables in uh in overall performance. And so we urge every customer to to test it out and and work with us to uh see if we can push it even further.

CRAIG:

Yeah, and again, for the lay listener, uh a token is is a uh a chunk of data that's roughly part of a word, and uh the the throughput is how quickly the model can uh can process those tokens. Uh and that's a matter of computation speed, right?

MATT:

That's right, yeah. So when you're talking about text, a token is usually like three to four characters of text, uh, sometimes bigger, sometimes smaller, depending on the token. And uh some of these models have a vocabulary of like 200,000 tokens. So that's kind of the options that it gets to choose from every time it generates, uh, takes one generative step. And in terms of our like optimizations with GPT OSS, we're talking about doing like anywhere from 500 to 1,000 of those steps per second. Uh that number of tokens are being generated per second. So it gives you the sense of you know every every little microsecond actually when we look at our profiles matters and being able to generate that quickly because if you have to do something a thousand times every second, uh a millisecond is is all you have.

CRAIG:

Yeah. And does that affect how well because context windows, you know, the the number of tokens that that you can provide a model in as input has been growing dramatically, but but models get kind of lost uh when the context window becomes too large. Um is that uh does the speed of token throughput does that improve the model's uh ability to handle large context windows? Or is it you're you it's it's the same problem, it's just seeing it faster.

MATT:

Right. It would still have the same problem. That's more of an architecture choice, training choice on how to get that uh they call it the forgetting kind of problem. Um uh but with some of the optimizations we've done, it actually helped a lot on the long context speed. Um, like we improved long context processing, like 100,000 tokens or more by 5x from where we started. Um, so uh the default implementations you get from some of the open source models and open source toolkits that run them is pretty inefficient in those long context scenarios. Um, but if the model's still gonna forget, it's uh we we're kind of optimizing performance, not that would be an accuracy problem.

CRAIG:

I would just talk to a lot of people about uh sort of uh post-transformer architectures to deal with that problem in particular state space models, uh like Mamba. I don't know if you know the manifest people in uh New York, they're coming out with some interesting models. Anyway, that those um state space models, do you work with those uh and and these you know sort of post-transformer iterations that attempt to deal with the with the context window problem and and forgetting and all of that?

MATT:

Yeah, I mean uh GPT OSS even interleaves different types of attention mechanisms. So attention is usually the operation that has to look it can become really slow with the long context. So a lot of the focus is there on things like Mamba and others, but GPT OSS interleaves different types. Um, some of the Quen next models that came out recently are are doing some interleaved hybrid attention, they call it. Uh so yeah, we're we're familiar with a lot of those, they help a lot. Uh I think that's why GPT OSS in particular is is so fast. Um but I I haven't personally um dove into these state space models, but I'm optimistic that Transformers is not the final uh architecture.

CRAIG:

And the reason I'm I'm I'm interested in this, if if you can handle massive context without the model getting lost or forgetting or missing information, uh then you you you you don't really need like a vector database or or these other external knowledge bases. You can put it all in the context and have the model uh do inferences on the information in that context. Uh do you see that as uh is where this is going, or or do you think uh there are other architectures that that'll solve those problems?

MATT:

I do think it's where it's going, um, but it will probably uh create need for for different architectures along the way. Um yeah, I I think the there's a growing need for like think of um use cases like I want a personal assistant who knows everything about me, like every conversation, every meeting, every tool I've used. Uh that's a lot of data and and and it never stopped growing. So uh that's kind of the direction we're we're moving towards, and uh it's gonna necessitate a lot of uh both engineering and novel AI architectures.

CRAIG:

Yeah. And where what are some of the use cases that you've seen for this super fast inference where where inference speed really makes a difference? I mean, of course, everybody wants things to go faster, but uh what applications do you see this being critical for?

MATT:

So time to first token and really the time to answer come in, especially with these chat kind of real-time experiences. And we've all experienced it, whether you're using Chat GPT, Gemini, uh Claude, whatever your favorite is, uh the longer it takes, the the less you want to use it. Um and I've actually like moved to Gemini Flash because it's very fast and uh and very consistently fast. Google does a really good job at serving these models. And um, and so the the response speed for like every day I'm in the middle of something, I'm like, ah, stuck in in code. I want to ask, is this possible or whatever? Jump over Gemini, get an immediate answer, and I'm back at my task. Uh, if you have to wait even a minute, uh you're you're distracted, you're contact switching in your brain, and uh it's gonna slow you down. So the the latency is important for those kind of use cases. Uh for a lot of the agentic workloads. So um I remember having this conversation with our investors a couple of years ago that uh about six months after ChatGPT came out, that I said all the future stuff is not gonna be in a chat window. It's gonna be agents working in the background. And this year I'm happy to see the shift, especially on coding uh tools, to becoming agents that work in the background. And we're using a lot of uh GitHub Copilot internally to do that. Uh and I think their experience is great. Um, I never took the plunge into things like cursor where I have to actually change my editor in order to use their product. I just want the AI to do the work and I can stay in the editor that I'm familiar with. And uh, in those scenarios where you're chewing through tokens, like the they do a lot of thinking, a lot of iteration, but they're asynchronous. The throughput is the important measure there. I don't care about time to first token because I'm asynchronously queuing that off, and it's gonna send me a notification in 20 minutes when it's done. So uh I think those are good use cases for uh for a lot of the uh token throughput optimizations we've done.

CRAIG:

So clarify uh has launched this as a service. Uh and and I presume one of the reasons we're talking is it's growing fast uh or or not. I mean, what what's how what's the response been and how crowded is the inference market now? Since everything seems to have shifted to inference from draining.

MATT:

Yeah, the the response has been awesome. Uh, and and especially that validation that customers, as they test it out, are are giving back to us. And we're gonna be available soon in in many different locations. So on the coding agent side, uh, we're integrating into tools like open hand, which is like an open source agent uh for coding. Uh we're in uh we have a pull request out for sales uh AI SDK. We're already in Langchain, Lamb Index, popular ones like that that you're used to, uh crew AI. And we're gonna be on some of these open uh routers as well, like open router, uh where they kind of go across all these inference providers and and rank by performance and price. Uh so we're expecting a lot of uh increase in volume because of all these initiatives to basically be available where where the developers are. Uh so we're excited for that. It is a crowded space, I would say. Um totally agree on that. But I think when people look at vendors, uh, they look at, you know, do they have a track record of doing this? And we have the longest track record out of anybody uh having done this since 2014. We were even out two or three years ahead of uh any of the hyperscalers with APIs for AI. And uh, and so they look at that and they they build a lot of trust uh based on that. And then look at the performance we're getting in in these kind of public benchmarks, and that excites them as well because they can do it in a cost-effective way. And and then finally they look at kind of the flexibility and uh of being able to run in our cloud, in your cloud, or your on-premise, and you get to choose, you're in full control. Uh, none of the other inference providers are are offering that. So uh we we like our position. We're really excited about how this uh latest product has come together. And now that it's out there, getting that customer feedback to validate it is great.

CRAIG:

Increasingly, uh inference is moving to the edge. Uh, it's as a lot of uh this happens on device, and now we've got uh you know uh self-driving vehicles um hitting the roads where you can't be relying on the cloud for inference. Do you guys work at the edge?

MATT:

Yeah, and even some of these government contracts we've done work on those. Uh NVIDIA keeps changing the name. It used to be Jetson, Integra, and then Orin. Now there's something else, I think. But yeah, those low power uh NVIDIA GPUs are great. And uh and yeah, we can do all the processing local uh within them. And edge also means a few different things we found as we talk to different types of customers. So one can classify it as that, those kind of low power devices. Another, uh, like telecoms, for example, we're working with a few right now where they have a huge number of data centers uh to power internet and your cell phones. And they're realizing that that could be a big advantage in kind of uh serving up AI traffic if you put good compute in those data centers. And we're talking about like thousands of them versus AWS, maybe has like five or so regions in the US. Uh, so it's kind of a couple orders of magnitude difference. And so they can be very local to where you are. So really low latency um kind of streaming applications, like maybe you want uh video analysis of an operating room uh in a hospital and getting alerts on that, and or even driving a robotic arm uh automatically using AI. Like those types of applications, you really need lots of processing and you need it to be really low latency. So uh that's another definition we hear of edge. Uh it's not a low-powered 10 watt device, but it's really close to you uh where it's needed.

CRAIG:

Do you think that uh right now so little of all the inference in the world is taking place at the edge? Do you think that that will shift as robotics become more versatile?

MATT:

Um absolutely. Yeah. I I think the next big wave, well, a couple of thoughts there. So it would be interesting to know how much inference is being done already on Android and iOS devices. I know iOS does a lot. I have an iPhone and I use, for example, dictation uh functionality all the time. And uh it works pretty good. Um it still doesn't get Kubernetes, which is like the most frustrating thing. But uh one day AI will catch up. The the word Kubernetes. Yeah, the yeah, it always spells Kubernetes uh is like a person's name or something. Um, but uh but at least it's happening on the device. So there's lots of inference already happening on the edge uh on mobile devices, and that'll continue. And every time Apple does an announcement and Google does an announcement, they're they're pushing the limits of how much uh capability is in their processors on these devices. So uh Qualcomm's doing a lot in the AI space now, too. So I think uh we'll continue to see a lot more in mobile, and then uh to your point, I think the next big wave in AI is gonna be the physical world with robotics. I think it's like uh five to ten years out before we'll be like amazed by robots. Um, even the most advanced ones we see out of Tesla are not that amazing right now. Um, I'm sure they're gonna get great, but uh I I don't think it's the AI actually, I think it's the uh the physical capabilities, the actuators and all that kind of stuff. I have a long way to go to replicate what a human is is capable of. Um, so I'm excited for that, but uh but that'll drive a lot of the compute towards the edge as well.

CRAIG:

Is there anything I haven't talked about that you'd like people to know about what Clarify is doing or what you think is gonna happen in the future? Whether we're all gonna regret this AI moment.

MATT:

Yeah, I mean, uh no, I think we covered a lot. We're we're very excited about the new offering. A lot of people remember us as computer vision because we started there and continue to lead there, but we do so much more in the gen AI space today. So eager to have more people try it out and uh give us good feedback, and uh we'll continue to make things even better.

CRAIG:

In business, they say you can have better, cheaper, or faster, but you only get to pick two. What if you could have all three at the same time? That's exactly what Cohair, Thompson Reuters, and specialized bikes have, since they upgraded to the next generation of the cloud, Oracle Cloud Infrastructure. OCI is the blazing fast platform for your infrastructure, database, application development, and AI needs, where you can run any workload in a high availability, consistently high performance environment and spend less than you would with other clouds. How is it faster? OCI's block storage gives you more operations per second. Cheaper. OCI costs up to 50% less for compute, 70% less for storage, and 80% less for networking. Better? In test after test, OCI customers report lower latency and higher bandwidth versus other clouds. This is a cloud built for AI and all your biggest workloads. Right now, with zero commitment, try OCI for free. Head to Oracle.com slash EyeonAI. EyeonAI all run together EYEONAI. That's Oracle dot com slash EyeonAI.

Sonix is the world’s most advanced automated transcription, translation, and subtitling platform. Fast, accurate, and affordable.

Automatically convert your mp3 files to text (txt file), Microsoft Word (docx file), and SubRip Subtitle (srt file) in minutes.

Sonix has many features that you'd love including upload many different filetypes, automated subtitles, advanced search, secure transcription and file storage, and easily transcribe your Zoom meetings. Try Sonix for free today.


Learn more
 
blink-animation-2.gif
 
 

 Eye On AI features a podcast with senior researchers and entrepreneurs in the deep learning space. We also offer a weekly newsletter tracking deep-learning academic papers.


Sign up for our weekly newsletter.

 
Subscribe
 

WEEKLY NEWSLETTER | Research Watch

Week Ending 11.30.2025 — Newly published papers and discussions around them. Read more