Jacob Buckman, the co-founder of Manifest AI, a cutting-edge research lab based in NYC. Jacob dives deep into their breakthrough: Power Retention—an open-source architecture designed to finally solve the quadratic scaling problem of long context windows that currently plagues Transformers

 
 
 
DOWNLOAD TRANSCRIPT

Jacob Buckman.mp3: Audio automatically transcribed by Sonix

Jacob Buckman.mp3: this mp3 audio file was automatically transcribed by Sonix with the best speech-to-text algorithms. This transcript may contain errors.

Jacob Buckman.mp3

JACOB: Mamba and State-space models in general are what I would consider retention models. These models that have this sort of duality where they can be expressed either as recurrent models or as attention models. And once you have this, this model that can be expressed in both of these forms, then it also unlocks a secret third form of the model, which people call the chunked formulation, where you basically combine all the best parts of recurrence and attention to get an architecture that both has tractable growth in cost. With respect, it has linear growth in cost as you increase the context, but also is able to really saturate the GPU. I think that many tasks right now are better suited for in-context learning, because current models have either very short context or because of architectures limited ability to actually use the thing in their context. It actually doesn't seem that most people are able to get enough value out of pure in-context learning, and so they have to essentially resort to fine tuning as a way of getting that information into the model.

CRAIG: Build the future of multi-agent software with agency. That's a g t c y. Now, an open source Linux Foundation project agency is building the Internet of Agents, a collaborative layer where AI agents can discover, connect, and work across any framework. All the pieces engineers need to deploy. Multi-agent systems now belong to everyone who builds on agency, including robust identity and access management that ensures every agent is authenticated and trusted before interacting. Agency also provides open, standardized tools for agent discovery, seamless protocols for agent to agent communication, Education and modular components for scalable workflows. Collaborate with developers from Cisco, Dell Technologies, Google Cloud, Oracle, Red hat, and more than 75 other supporting companies to build next generation AI infrastructure together. Agency is dropping code, specs and services no strings attached. Visit agency.com to contribute. That's a g t c y.

JACOB: Hey I'm Jacob. I studied computer science at Carnegie Mellon and then went on to do a PhD focused on AI at Mila in Montreal, and I'm one of the founders of manifest AI. We're a AI research lab, independent, based in New York City, focused on pushing the frontiers of the current paradigm, understanding that the attack that has taken us this far is not going to be the only pieces needed to get us all the way to the, you know, AGI future that everyone is hoping for, and looking at the whole field from first principles to make the fundamental breakthroughs needed in order to get to that point.

CRAIG: Just a quick question. How are you funded in that you're a lab without a product at the moment?

JACOB: Um, we ultimately do plan to, uh, release products. Once we've made these breakthroughs. We're going to then build a technology, build products around the technology that we develop. Um, we're backed by Decibel and True, which are two highly technical, uh, VCs based in San Francisco. And we're looking forward to, you know, now that we've made a lot of progress on the architecture side, starting to build our first products around this technology.

CRAIG: So we're going to talk about your new architecture.

CRAIG: That goes beyond transformer models and solves in particular the the scaling problem. Uh, with transformers, as you scale the the context window scales quadratically because you every, uh, token has to, uh, um, record its position relative to other tokens, uh, as and as you scale that, that becomes untenable or just needs a tremendous amount of compute your concept of power retention, which, as you were saying, is, uh, combination of, uh, power, attention and recurring recurrence, not necessarily recurring neural networks. Uh, was that the concept that you started manifest with, or is it something that developed, uh, once you had started the lab?

JACOB: No, the specific architecture is something that we developed only after really understanding what the sort of fundamentals of the problem were. We started off with just a vision for the future of the field, and an understanding of the first breakthroughs that would be needed, at minimum, to get there. And ultimately, the angle was based around the specific sort of like technical requirements to solve the problem, but that was sort of downstream. You know, the process of doing research is about understanding and discovering these things. So the insight that we started manifest with was about scaling laws, and in particular that certain scaling laws were sort of, uh, well situated by the current paradigm. In particular, the parameter scaling laws is one example where as you increase the parameter size, the model's performance improves and the cost to do both training and inference on the model grows linearly with respect to the parameters. Linear growth is very tractable. This is what has allowed us to push the scale this far. Yes, it's expensive, but you get good returns on the investment because the growth in cost is consistent. But there's a different dimension of scaling the input size where this is not the case. And right now with transformers, as you scale it transformers to larger and larger inputs, you end up paying, as you mentioned earlier, a quadratic cost on the size of the input.

JACOB: So if we're talking LMS language models, the input is a big prompt, right? What's called the context window. It's all the text that the model is ingesting synthesizing in order to produce a response. And as you increase that context window, the cost to train these models grows quadratically instead of linearly. And it's basically unique among the possible axes of scaling that one might consider in this fact. And if we think that the sort of like neural networks of the future are going to be able to process and synthesize data from very large inputs, then just some quick napkin math will make you realize it's completely intractable to be processing these inputs at a quadratic cost. And understanding this, and basically making the first, um, sort of like insights towards what a solution would look like was where we were at when we started manifest. And since then it's been about two and a half years, and it's been a journey of essentially, um, both empirical and theoretical perspective, understanding exactly what it means to have where the quadratic cost comes from, why it's important, how it affects learning, how it affects training time, how it interacts with the hardware. We're training these models on GPUs. So you need to have something that is very hardware efficient as well. And power retention is the outcome of all of this research.

CRAIG: It's not dissimilar in that you're dealing with a state space model. Is that right then from uh, from Mamba, which I've done an episode on earlier. Did you start by looking at Mamba or just was were state space models attractive to you for other reasons?

JACOB: Yeah, Mamba and state space models in general are what I would consider retention models. They're models actually. Mambo one less so, but Mambo two for sure are these models that have this sort of duality where they can be expressed either as recurrent models or as attention models. And once you have this, this model that can be expressed in both of these forms, then it also unlocks a secret third form of the model, which people call the chunked formulation, where you basically combine all the best parts of recurrence and attention to get an architecture that both has tractable growth in costs. With respect, it has linear growth in cost as you increase the context, but also is able to really saturate the GPU, which is what attention is able to do. And traditional recurrent models like recurrent networks like LSTMs from the previous generation of the kernel networks were unable to do. So that's sort of the first piece of the puzzle was, and this wasn't a piece that was done only by us. This was done by many groups in parallel, including the Mamba group. Um, to just realize that there exists this, uh, sort of like other family of recurrent models that have this attention form duality that enables the really efficient implementation. But that's not enough. Uh, the issue that we found with Mamba, and actually with basically all other modern RNNs or modern retention models that, uh, we were looking at as we started this research, is that the state size is too small, and in particular, the state weight ratio is highly skewed in the direction of the weights.

JACOB: In other words, the state is too small for the weight size of the model. And this is not true for transformers. Transformers. The state of a transformer. It's typically not thought of as a state in the sense of an RNN, but it actually is a very nice application to understand that the state of a transformer is the KB cache. It's this sort of record of all the tokens that had seen up until now. And if you consider the cache or transformer to be the state, well, you realize that the cache is massive at long contexts. So transformers actually have a huge, huge state compared to RNNs, even modern RNNs. And this is the downfall of RNNs because state size has scaling laws just like parameter size does. Having a larger state translates into better performance, and so you end up seeing with a lot of these retention architectures, including Mamba and including, uh, gated, Delta Net and Rokov and various other subquadratic architectures that you might encounter. Is that the speedups that you get, which you do get at long context, come into play at exactly the same point that performance relative to Transformers takes a hit.

JACOB: In other words, by the time you're training on context long enough to for them to be meaningfully fast. The performance, because the state is so small is far worse than Transformers anyways, so they end up only being sort of compute optimal at a very narrow range of flop budgets, like a very, very small flop budgets. And you basically never want to actually train an architecture like this at scale. What power retention does differently? It's sort of the retention part is generic to all state space models, including mom and whatever else. But the power part. That's the unique part. Power retention is a retention model that has an extra adjustable axis of scale. You can adjust the state size independently of the parameter count, and this allows us to exploit state scaling as an axis of scale, just like we've already been exploring parameter scaling. And so we're able to construct models that are retention models. They are RNNs, but they have state sizes that are as large as we want them to be, and this allows us to basically pick the optimal state size always and have actual compute optimal, uh, performance for a very, very large range of cloud projects.

CRAIG: Can we back up a little bit? Uh, for the benefit of listeners that are not familiar with a lot of these, uh, terms, first of all, uh, state space models, uh, my understanding is, is that they record, uh, the, the, the state of the, um, of the model at any point in time, it tell me if I'm wrong, but the the weights of all the nodes in a network and, and all of that is recorded as a state, and that state then, uh, goes through a transition as the model, uh, learns or, or, uh, you know, operates in, uh, in, An inference is that. Is that right? Can you describe that a little more precisely?

JACOB: Yeah, absolutely. So I think, uh, separating between the state and the weights is an important conceptual distinction here. There's sort of two things that inform a model's decision when the model sort of gives you back a response. There's two things that affect it. One thing is whatever question you asked it. Right. So if you ask a model a question, that is the context. All the tokens have the question get turned into context, and that context gets summarized into a state. The state is just sort of like a list of numbers that summarize all the context seen so far. And that state obviously impacts what the model's reply is going to be. Then separately, there's a different source of information. This is basically the entire internet. And now the entire internet affects the response of the model via the training process. Training is essentially about compressing the knowledge on the internet into the weights, which are just a list of numbers describing all the knowledge the model has about everything. So it's these two things, the state which summarizes the specific information in the question, and the weights which capture all of the knowledge the model has on the internet that combine to produce the response and state space models. They maintain a state and as they see more tokens, for example, as your conversation continues and you go back and forth discussing some topic, they keep updating the state.

CRAIG: I see and the state is is the context side. It's not the model weight side. So uh, yeah. Uh, okay. And then recurrence, can you, can you do the same with recurrence.

JACOB: Yeah. Uh, they're the same. What I just described is the same exact concept of recurrence. Recurrence is just a function that you can apply over and over again. And, uh, that is just exactly what I described with a state. It's a function that you the input is the state and the new token seen is the thing you're adding to the context. And then the output is the new state and the updated context. So yeah, interface models are essentially just a modern rebranding of recurrent neural networks.

CRAIG: And the power in in power retention is. Describe that. Again.

JACOB: The power and power retention refers to sort of esoteric mathematical operation called the symmetric power. It's basically just a mathematical operation. You can do to a list of numbers to turn them into a larger list of numbers with some exploitable symmetries. And this is basically the key insight is that you can use this operation to make a state larger and larger without introducing any extra parameters. So the details of how it's implemented, um, while very interesting I think are not crucial to the larger picture. Um, the important thing is it's just a lever, essentially a hyperparameter that can be set to adjust the state size of a recurrent neural network. And what's crucial is that it's a very hardware friendly one, so you can adjust it in a way that doesn't hurt the ability to exploit a GPU.

CRAIG: Yeah. And in fact, GPUs have been tooled to work well with state space models. Is that right?

JACOB: Uh, yeah. State space models. One of the main reasons why state space models are sort of the most powerful family of RNNs is that you're able to get really good performance from GPUs on state space models specifically. Um, this is not true for all RNNs, like for classic RNNs such as LSTM or GRU architectures. Uh, there's for some like deep reasons, it's actually basically impossible to get really, really good performance on a GPU just the way that GPUs are constructed. They're kind of good at just one thing big matrix multiplications. It's not strictly true, but first order approximation, that's basically what they're good at. And there's no way to express, uh, arbitrary RNN as a big matrix multiplication. But for state space models or, you know, the Gemini is usually retention models. Retention models are a special family of RNN for which you can implement them as a couple really big matrix multiplications, and therefore get really good hardware utilization.

CRAIG: Through post-training and and through inference. The the model weights don't change. Is that right? I mean, they're they're fixed by the training. You you you you operate uh, you perform inference from them. But the the knowledge of the model doesn't change through operation under in a in a transformer model in in this model is is do the weights change over time?

JACOB: No, it's the exact same setup. Conceptually, the weights are almost by definition. The weights are basically defined as the things that get updated during training, and the state is in some sense defined as the thing that gets updated as you show it more tokens of context during inference. So this is basically like the way that these two concepts are partitioned. So the weights do not change during training. There might be some other research that blurs the boundaries here, but by and large this is the this is the setup okay.

CRAIG: So so you you've now built a model of this, uh, power retention model. Uh, and what's the name of the model?

JACOB: So we've currently released one model trained with uh, one one model whose architecture is based on power retention called Power Coder. And it's a coding assistant at the 3 billion parameter scale. So still fairly small. But what's really cool about it is it's sort of a proof of concept for this idea. We call it metamorphosis. You start off with a transformer, a pre-trained transformer, and by just doing a little bit of retraining, what you end up with is a power retention model of equivalent performance. What this means is you can actually get the benefits of power retention, such as fast linear cost inference and cheap, uh, post training on large contexts while still getting the benefits of pre-trained transformer weights that, for example, you know, large model providers will release. And so you're not actually forced to to throw all that, like very expensive, very useful weight training away. You can actually exploit it even when you're using power retention. So the point of power coder is sort of as just a proof of concept of this. It's a, you know, usable good code model, very fast inference. You can you can download it from Huggingface and play around with it right now. But I think what we're really hoping to do with this is showcase what it looks like to start from a really solid, uh, transformer pre-trained weights and convert it into, uh, equivalently performant, but much faster and more flexible power retention variant.

CRAIG: And that's interesting when you say start with a transformer model, uh, with pre-trained weights are will you use one of the open source models or, or do you, uh, build your own GPT and then, uh, you know, work on it from there?

JACOB: No, we get to start with the open source models. So, for example, we could have started with llama. We didn't in this case, but we could have started with, for example, the llama 70 billion weights. This is a good solid transformer model. It has its architecture. It has its weights. And you can use it for inference. And what you can do, then you just take a reasonably small number of GPUs. Not nothing, you know, but like, you know, on the order of dozens of GPUs, not thousands of GPUs, and spend maybe six hours retraining Lama into a power retention variant. And all that means is you start with the regular Lama weights and the regular Lama architecture, go into the Lama architecture code and remove the call to attention and put in a call to power retention. Just that one line swap and then just start training it. And what you'll see is within just a couple hours of retraining, we'll get performance back to where it originally was when we were just using the Lama model off the shelf. But now, of course, it's a power retention model. It's no longer a transformer.

CRAIG: And the training of the what do you call that, post training or fine tuning?

JACOB: Yeah, we typically call it retraining. Fine tuning often has a connotation of only training some of the weights, and we find that this works best when you train all of the weights. So if we just call it retraining.

CRAIG: Okay. And uh, but it doesn't affect the size of the model or the number of parameters.

JACOB: Correct. It's the exact same number of parameters.

CRAIG: Yeah. I mean, what I'm, what I'm sort of leading to is, is, uh, and you're talking about how, uh, you know, you're you're looking for new architectures to get us beyond Transformers and toward, uh, the holy Grail. Um, and, and one of the, the blockers with Transformers, uh, aside from the scaling context window, is catastrophic. Forgetting that you can't the model can't. You can't grow the model's knowledge beyond its original training. You can fine tune tweak a little bit, but if you try and do a full retraining or full, you lose a lot of the original knowledge. Does the power retention model have the same, uh, same problem?

JACOB: Yeah, it doesn't address that aspect of things. That is just a property of the weight updates. Gradient descent is making local updates to the weights. And so that means if you make enough local updates to enough of the weights, you'll be in a very different sort of part of function space than where you started out. And I think these days it's actually not such a problem. People have gotten very good at doing, uh, what they call like Laura, like low rank updates. Um, or this is the what I was referring to earlier with the fine tuning. Um, essentially, people have gotten good at starting with a base model, a pre-trained model, and only updating a few of the weights, or only making small updates to the weights in order to specialize the model on some particular task or data set, and in so doing they avoid destroying the knowledge in the rest of the network. They're just adding a little bit of specialization for this one particular task that they care the most about, um, doing real, meaningful, like additional pre-training, like adding like a huge amount of extra knowledge to the network that is still seemingly, um, out of reach of, of standard techniques. Uh, you need to have the original data set still around, and as you do continued training, you would need to basically continue training on the original data set to preserve that knowledge. And then you can start to mix in some new data to obtain that knowledge. But if you just try to train only on the new data without also in parallel continuing to train on the original data distribution, you will typically forget a lot, but that only happens when you're doing a lot of training on all of the weights.

CRAIG: But in your model, if you can increase the context window, I don't know. Is it, uh, in indefinitely? I mean, provided you have enough compute, you can add knowledge in the context window for for the model to use in inference. Right?

JACOB: Yeah, exactly. I think that many tasks right now are better suited for in-context learning, which is the process you just described. Then fine tuning, which is a process I described a moment ago. But because current models have either very short contexts or because of their architectures limited ability to actually use the thing in their context, it actually doesn't seem that most people are able to get enough value out of pure in-context learning, and so they have to essentially resort to fine tuning as a way of getting the information into the model. But one of the strengths of power retention is that a model trained end to end with power retention on long context still has a tractable cost, and these models are not suffering from the problem of only part of the context being genuinely useful. They're able to get value out of the entirety of their context. So we think that when you start really seeing these power retention models trained at scale on, uh, varying like interesting data sets, we'll be able to get most of the value that we currently get from fine tuning just off of pure In-context learning. Just show it some examples in the context at inference time of what you want the model to do. And uh, watch it sort of like learn on the fly how to do that correctly.

CRAIG: Yeah, that's that's amazing. And what what are some of the applications, uh, that you see with this? I'm just thinking there are there you know, people use rag and, and and all these different, uh, and pre-training to, to to get around hallucinations and have models make inferences based on proprietary data. But if you can put it all in the context window and then work with the model, you aren't faced with those engineering problems. Is that right?

JACOB: Yeah, I think it makes things a lot, a lot easier. Rag is essentially a technique for selecting the right tokens to put into the context. So the reason you need to do Rag is because you have to make basically hard decisions about what information do you actually want the model to look at to synthesize when making its decision? And the larger the context window gets, the less you're forced to rely on rag, because the more you can just dump everything into the context and let the model decide for itself what information is most relevant to the decision that it's about to make. Now, this doesn't mean that it will necessarily be able to memorize everything still, so there will probably still be a place for rag or rag like methods going forward, but it will basically relieve a lot of the burden of having to design these essentially ad hoc external similarity functions or search heuristics in order to decide what stuff to put into the context. One of the things that I'm the most excited about, actually, within this space is information retrieval agents. So rather than having a Rag heuristic to decide what information gets shown to the agent, another way to handle really, really vast or private information sources that the model needs to look at before making a decision is to let the model use something like tool calls to decide for itself which pieces of information it wants to obtain.

JACOB: Now the training process becomes one that you could do with, for example, reinforcement learning of letting the agent learn given a particular query. What are the right things to search? Given what I found in my search, what is the right thing to search next? What is the next sources of information I should pursue in order to get a complete picture of the response that I'm trying to give. And this is something that requires really, really long context contact lens because you want to have this entire trajectory be remembered by the agent, instead of having to break it up into little pieces where the agent sort of forgets what's going on in between. And I think once we move on to that sort of paradigm of information retrieval, and we start asking questions to agents that can end to end, represent their entire history and access vast troves of private information in order to come up with a good response, we'll really feel like we've, uh, stopped the issues with hallucination. They just won't come up anymore because like a human, the model will spend some time doing research before coming to a conclusion, and we'll actually be able to remember everything that it encounters during that whole research process before giving a response.

CRAIG: Build the future of multi-agent software with agency. That's a g t c y. Now an open source Linux Foundation Project Agency is building the Internet of Agents, a collaborative layer where AI agents can discover, connect and work across any framework. All the pieces engineers need to deploy multi-agent systems now belong to everyone who builds on agency, including robust identity and access management that ensures every agent is authenticated and trusted before interacting. Agency also provides open, standardized tools for agent discovery, seamless protocols for agent to agent communication, and modular components for scalable workflows. Collaborate with developers from Cisco, Dell Technologies, Google Cloud, Oracle, Red hat, and more than 75 other supporting Companies to build next generation AI infrastructure together. Agency is dropping code, specs and services no strings attached. Visit agency.com to contribute. That's a g t c y. Oh, yeah, but what about the problem of, uh. You know, the longer the context window, the, the less performant, uh, the model and the model gets lost or, uh, in the context or, or, you know, some of the knowledge in the context, uh, it it, uh, doesn't pay attention to how how what about that issue? I don't know what the word is for that issue, but.

JACOB: This is something that we attribute mostly to that self-same quadratic cost of long context, but indirectly. So let me explain. Given that model providers want to offer their users long context models, and it's intractably expensive to train a true transformer on long contexts for any meaningful amount of time. What a lot of model providers do is come up with a sort of in-between solution. What they do is instead of using a true transformer, they use what's called a windowed transformer or a sparse transformer, something that only attends to a small subset of contexts rather than actually attending to the entire context window. Oftentimes they'll do what's called hybrid, uh, sparse attention, where some layers will be full attention, but the vast majority of the layers will just be this local windowed attention or this, this sparse attention. And the result is that it's not true that the entire context is being processed via attention, right? The affective context size is actually much smaller. And because there's these weird sparse attention systems where the attention is sort of heuristically distributed across the input context, there's just going to be some sort of like hot spots and some, some dry patches throughout the context. This is part of the problem. And then this is compounded by the fact that instead of training on long context data, most of these models are trained mostly on short context data. For the vast majority of their training time and again, this just comes down to the fact that training on long context is really, really expensive. And so the model providers do it as little as they can get away with. And the result is a model that has some limited long context ability. And of course it gets advertised as a model with context length x, right. Because, you know, the users are typically not running detailed evaluations and they're not privy to the details of the training or of the architecture that might make somebody raise an eyebrow as to whether this is actually a model that's going to be effective at long contexts.

JACOB: They just, you know, take the model providers word for it. They say it's an alien context. Okay. It must be a million contexts. And then they try to use it on tasks that are a million tokens of useful context long, and are surprised when the models can't actually effectively process pieces of the context that it's actually not able to use the whole thing as effectively as able to use the first, say, 32 tokens of context. And this is a little like, uh, you know, shell game bait and switch. I think it's ubiquitous. Like, I'm not pointing fingers at any one person in particular. And it's also very understandable, but it is sort of a dirty secret of the industry that nobody's models are actually long context transformers. The long context models are they're something else. And what's beautiful about power retention is that it will actually enable end to end training at long contexts of every layer of a neural network for the entire training duration. And in the models that we've trained so far, we've never seen any degradation at any points in the context from power retention. But when we train similarly sized sparse attention or windowed attention or hybrid attention models, we do actually see similar degradation to what we saw in like for large scale models. So we're guessing that this sort of difference is going to persist as we continue to scale things up. And once the largest models are power retention based, there won't be any downsides to using more tokens of context.

CRAIG: Mm. Wow. And yeah, that has huge implications. I mean, uh, you were, uh, talking about, uh, information retrieval and building up the context, uh, for inference, but could it also apply to, you know, there's been a lot of talk about, uh, you know, as the reasoning models have developed about, uh, doing original scientific research. Um, if you have essentially an unlimited context, uh, could you, uh, load all of, say, one domain scientific knowledge and, and then ask the model to reason over that or come up with novel hypotheses or, or something like that?

JACOB: I believe so, and this is one of the reasons why at manifest, we were so focused on unlocking this problem of large input sizes. Right? Like input size scaling, it's because we view that many of the most critical and important applications of AI, including, of course, scientific research and making discoveries about the physical world, are going to be heavily reliant on synthesizing huge, huge amounts of information in order to reach a conclusion like a human scientist. Not only are they reading a vast literature, they're also thinking constantly about their research domain, engaging in conversations for hours and hours, for days, for weeks, for years, for an entire career. Sometimes it takes ten years of consistent work on a problem thinking about it, talking about it, reading about it, reading what other people have done in similar fields in order to reach actual insightful conclusion. And this is just something that's not going to be achievable by models whose context length is so constrained. Maybe they can take a little bit of information, the amount of information they can store in their context and think really hard about it and come to something, uh, interesting, say something interesting about that. But the context required to say something deep about the real world is simply too vast, in my opinion, to be expressed in a very, uh, short uh, context to reason over. So we're going to need models that can actually genuinely synthesize massive, massive amounts of information before reaching a conclusion in order to unlock these high value applications.

CRAIG: Yeah. Um, and in terms of, uh, continual learning, I mean, it's the catastrophic forgetting, the fixed parameter size that. Does this have any implications in solving some of those issues? I know you said it. It doesn't with catastrophic, catastrophic forgetting, but I'm just thinking if if the context window can grow and evolve, in effect, the context window becomes kind of another model, another knowledge base. That's that's, uh, talking back and forth with, uh, with the uh, inference model is can you talk? Am I off base there?

JACOB: Yeah. No, you're you're you're right on point. I think that people who are very concerned right now about things like catastrophic forgetting are concerned because at the moment, there's only one sort of good lever for injecting new knowledge into a model, and it's the weights. And the reason it's the weights is because the context is just not long enough to inject very much knowledge. We have a giant data set of valuable information. Well, you're not going to be able to put it all into the context. And so all you can do is put it into the weights. Putting it into the weights means you're battling against catastrophic forgetting, because now the weight updates could destroy pre-existing knowledge in the model. And so now that seems like a big pressing concern, but it's only a concern because we chose to put the new model. Sorry to put the new knowledge into the model via the weights. If we instead had decided to put the new knowledge into the model via the state via the context, there'd be no problem if that the fact that there'd be no issue with the fact that there's catastrophic forgetting on the weights, because we wouldn't need to update the weights in this way. And that's what I believe the future direction should be, that we just focus on updating the model, updating the knowledge of the model via the context. I think something that a lot of people sort of an analogy that a lot of people use when thinking about AI, they think about the weights like our brain and weight updates like learning things, experiences in the real world, translating into new thoughts in our heads.

JACOB: And I actually think this is completely wrong. I think the proper analogy is that us living our lives, interacting with the world, and absorbing thoughts into our heads. I think that is state updates. I think the whole life that I have lived up till now has been one long sequence of context. And my brain, The electrical signals in my brain that control my thought patterns is basically the state that I have arrived at as the result of all the context I've seen, weight updates had nothing to do with it. Where do weight updates live? In my opinion, the proper analogy here is evolution. The human genome has been shaped by many years, many, many years, of course, of evolution that controls not just like us and our ancestors, but like our even like primordial ancestors, and how they see and interact with the world. And evolution, like gradient descent, is basically this process of local search. And the outcome of that is we have a damn good brain. And this brain, just like a trained neural network, is capable of implicitly storing a huge amount of knowledge. And most of that knowledge is focused on how to process incoming contexts to update the state to a correct new state. And I think that is where the field is going to head. We're going to stop worrying about updating the weights with new information. We're going to say, okay, the weights are good, and we're going to think about how can we put the right new information into the state via the context.

CRAIG: Okay. So, uh, but one thing I don't understand there. So the context can grow and you can have agents adding to the context. The model, uh, performs inference based on its original, uh, parameters. But the new knowledge that's created isn't stored in the model. It's stored in the context.

JACOB: Stored in the state is how I would describe it.

CRAIG: Yeah, but but the state is the context.

JACOB: The state is derived from the context. Yeah. So there's sort of like, uh, you start with the context, which is like text, for example. Right. And you turn that text into the state. The state is like a list of numbers, a vector of numbers that's stored internally in the model. Now, there's actually a perfect analogy here on the other side, which is you have a data set, a data set, for example, all the text on the internet. That's text. Right. So the data set is the analogy for the context. And what do you do with this data set. Well you turn it into weights. Weights are a list of numbers stored on the model that basically tell it how to behave. So the state is a list of numbers, just like the weights. And the context is some text, just like the data set.

CRAIG: Where is the state store? I guess that's my struggle.

JACOB: Uh, it's stored the same place as the weights. Just, you know, a file on disk or, uh, in the Ram of a GPU. It's just a list of numbers that the model uses every time it makes a prediction.

CRAIG: And so that, uh, as that state evolves with new knowledge, uh, and the model will always have access to that new knowledge, it doesn't disappear the way. Now, if you if you load a bunch of stuff into the context window of a model, uh, and then ask for inference, it performs inference. But when you're done with the operation, the model is still where it was. And the the context is still, uh, outside the model.

JACOB: Yeah, exactly. This is a product. Like the reason why these interfaces are set up like this is because context links are very limited. So typically when you're interacting with, say, an AI chatbot, right, you'll open a new chat. And this new chat will be a model whose weights are the pre-trained weights that you're working with and whose state is basically the empty state. Or usually the chatbot will come with a small prompt that says you are a helpful and friendly AI assistant, blah blah blah, right? So that context is injected into the model, updating its state to some initial state, and then you start having a conversation that you say something. The model replies, you say something else. The model replies again and it with each response on both sides. The model state continues to be updated. So now, halfway through your conversation, you have the same weights as you started with, but a state that has been updated to reflect all the content of the conversation that you've just had. And then finally, the AI gives you the correct answer to your question. You're satisfied and you end the chat. Now you start a new chat, and in this new chat you're back to square one on the state, right? It's thrown out the state of the old chat, but this is because current models are not designed to be able to constantly update their state at fixed cost. Current chatbots are based on transformers and transformers, unlike the retention models that we develop also Mamba RNNs like LSTMs, etc. Transformers are different from all of those because Transformers do not keep a fixed size state, the transformers state constantly gets more and more expensive as more of the conversation is had, and as a result, these products are designed to force you to end the chat after a certain amount of time.

JACOB: And so this is how people are used to interacting with AI chatbots and and basically what they're familiar with. And the dynamic this produces is one akin to a consultant. Right. When you go to an AI chatbot, you're basically going to somebody who is smart but has no knowledge of you or your problem or very, very limited knowledge of you, and no knowledge of the problem that you're about to ask them. But what we want is a dynamic that's more like a butler, somebody who has known you your whole life, you know, raised you from a boy, uh, and understands all of your needs, everything you like, all of your dislikes, all the problems you've had in the past and what the solutions you ended up coming to were somebody who can understand and preempt you, and doesn't need to be regiven a whole bunch of context for every question that you want to ask them, because it's all just been retained. It's just been one long session. There's no opportunity to start a new chat with a new model. You wouldn't want to, because the model that you've been talking to already knows so much about you, and is actually capable of using that information to better answer your future queries. This is something that just currently doesn't exist, and it is downstream of the fundamental design of the architectures powering the models that people currently use. And once these models switch over to the retention framework, the dynamic will completely change.

CRAIG: And so continual learning is then possible.

JACOB: Continual learning in the sense of updating the state with each new piece of information. Yes.

CRAIG: Yeah, but updating the state is also updating its knowledge. And, uh, would that be, a, a model, um, that's updating in state while interacting with different users or, or will it be an instance of a model that updates its state, uh, over its use with a particular user or particular team or particular enterprise?

JACOB: It's an interesting product design question. Off the top of my head, I would guess that probably the most useful generically modality here is to just have it be updated per user. So each user has one particular copy of the AI with a state that records all interactions previously with that user. However, I can think of different applications for the other ones. For example, maybe at a company it would be useful to have one AI that is the company's AI, and every single person at the company talks to the same AI. This doesn't mean it can sort of only reply to one person at a time, but it also means that it's constantly absorbing knowledge from all the different employees, from all of its conversations, so it can use something that it learned in a conversation that you had with AI. To give me a good answer. And at a company like at an organization where everyone is driving towards the same goal, I think this could be a very powerful sort of information sharing tool is, you know, sort of serving the same role as a project manager. Everybody tells the project manager what they're working on and what their blockers are, and the project manager collects all this information and then sort of broadcasts it out to the people who need it. And if we imagine a single AI whose context is constantly being updated by conversations with not just one, but everyone on the project, you can imagine that AI itself serving a similar role in that it coordinates the team towards the objective. So I think this is a big and very interesting space of product design that just gets unlocked once you have the fundamental capabilities available.

CRAIG: Yeah. Uh, yeah. I mean, it sounds potentially revolutionary. I mean, is that how you feel?

JACOB: Yeah, I, I'm very excited to, uh, have been able to make this much progress on what we feel is probably the most important, uh, project in the field. This is the breakthrough that needs to happen. We feel that we've made it. And I'm excited to see how the next couple of years of AI progress pan out as a result.

CRAIG: Yeah. And, uh, the idea of this, uh, evolving state, uh, that's that's accumulating new knowledge, uh, that could be applied. I mean, the state could be a physical state. Uh, right. It could be applied to robotics.

JACOB: Yeah, absolutely. That's a domain that we've worked with less, I think. You know, once you start touching the physical world, things get, uh, they get complicated. But. Yeah, absolutely. Just as with the sort of, like, brain analogy that I described a moment ago. Um, I think that, uh, very promising way to have robotic robots that can handle sort of like, arbitrary or unexpected terrains is to be maintaining a single unified state for the entire continuity of this robot's existence. It can get used to its own limbs. Right? Like, even if you have multiple, uh, if you have even if you manufacture multiple instances of the same robot, each one is going to be slightly different from the others. And each one, as it progresses across its lifetime, is going to accumulate like wear and tear on its limbs. It's going to have particular idiosyncrasies that the other versions of robot don't have, and, crucially, that it wasn't trained to have at the beginning of its lifetime. And I think in order for these robots to basically be able to adapt to this damage, you know, like a human who injures their foot doesn't lose their ability to walk. They just walk with a limp a bit more slowly, but they can still get around, right? And if it gets really bad, maybe they have to get a crutch. And now they have what amounts to a completely different walking pattern that they have to learn. But they can learn it, and they can learn it because they have their entire memory of lifetime walking experiences, their whole physical intuition up till now as a starting point. And then they're able to learn and adapt on the fly to the new sensations that they get as they keep moving around. And I think this is going to be something that is going to be very important for the sort of like robotic, uh, control applications of the future or to make them really work.

CRAIG: Yeah. And the inputs, uh, for the context and for the state, does it? It doesn't have to be text. I mean, I'm really interested in the whole world models. And, you know, where where the model is learning from from direct experience. Uh, I mean, not direct, but through video. I mean, eventually, if it's embodied, it would be direct. Uh, could this apply to to this, uh, power retention model?

JACOB: Yeah. The underlying tech is completely generic. Um, the way that we get text into the model in the first place is via this thing called embedding. Right? You basically turn each token into a vector of numbers, and you don't have to turn only tokens into vectors of numbers. But you see, even with multimodal models today, you can also turn an image into a vector of numbers. You can turn a frame of a video into a vector of numbers, a snippet of audio. Right? All of these things are just signals that can be converted into numbers and fed into the model. And once they're in the model, that's when the power retention, uh, sort of layer gets applied. It's about, uh, basically combining different sources of information in a way that's both computationally efficient and very powerful. And this is going to apply, uh, generically across all modalities. I do think there's something very special about text, which is unlike basically every other modality text is from human brain to human brain. It's a communicative medium instead of a natural medium, something that can be understood instead of just perceived. And as a result, the signal to noise ratio in text is way higher than basically any other modality. If you look at the pixels of an image, just the vast majority of the information contained in those pixels is sort of surface statistics. Or just like general signals, correlations between things in the environment. Um, it's not sort of dense knowledge in the way that a paragraph of text would be. Bit for bit text is just the best source of knowledge. And as a result, it's the way that we can get the most intelligent behavior on the lowest compute budget, because we're spending basically all of that compute budget, extracting intelligence rather than spending a lot of it parsing through, uh, you know, like shapes and colors and other sort of like, um, less complex, uh, features, less semantically rich features. So I do think that text is going to inevitably be a crucial component of any future model, but I don't think it's going to be limited to text. Not by a long shot.

CRAIG: And so you built a proof of concept. You published the paper, done a pre-print. The paper. When is the paper? Is it going to be at NeurIPS or not?

JACOB: Neurips, but probably at the conference after.

CRAIG: Okay. And, uh. And what are you doing? Uh, are you building a larger model now or.

JACOB: We're continuing to do these, uh, re trainings at larger and larger scales. And we're also starting to work with, uh, partners in particular domains, um, looking to apply power retention to various data sets where we think it would be particularly well suited for datasets where long context processing can add a lot of value.

CRAIG: And your hope is that Of this architecture will be taken up. Well, you're you're publishing. Uh, this is going to be open source. Is that right?

JACOB: It already is. You can, uh, just pip install retention and you'll have a whole toolkit of power retention. You'll also have our fast implementation of flash retention. Um, and a couple other goodies in there as well. Uh, yeah. It's all open source on GitHub. The models and model weights are available on hugging face. So yeah, we just want people to, uh, you know, get these things in their hands, play with them, train big models, do fast inference and really just see how powerful it can be. Um, it's something that we believe is that no individual lab can do all the work in validating, uh, new architecture or something that, uh, deserves to be done by the community.

CRAIG: And when did it go up on Hugging Face?

JACOB: Um, a couple weeks ago. I think this was, uh. Yeah, two weeks ago.

CRAIG: And have you had, uh, is there been a strong reaction or are people still trying to figure out what it is?

JACOB: Yeah, people, uh, people seem to like it. They've been they've been playing with it. Um, again, this is all more on the proof of concept side still, but it's just something that people can have in their hands and see the power of it. So yeah, we've gotten a good amount of traction so far, uh, with the initial launch. And I'm excited to see, uh, you know, how things pick up in the next couple of months to a year as well.

CRAIG: Yeah. Well, I'm excited too, and I'm I'm going to be following it closely. Is there anything I haven't asked, Jacob that that you, you think listeners should hear?

JACOB: Um, no, I think you mostly covered it. Yeah. Um, I would encourage any listeners who have interesting long context data sets to please reach out, especially if you have cool enterprise applications to go along with them, because we'd love to work with you. Training models, uh, seeing how power retention can really add value in lots of different domains.

Sonix is the world’s most advanced automated transcription, translation, and subtitling platform. Fast, accurate, and affordable.

Automatically convert your mp3 files to text (txt file), Microsoft Word (docx file), and SubRip Subtitle (srt file) in minutes.

Sonix has many features that you'd love including secure transcription and file storage, collaboration tools, share transcripts, advanced search, and easily transcribe your Zoom meetings. Try Sonix for free today.


Learn more
 
blink-animation-2.gif
 
 

 Eye On AI features a podcast with senior researchers and entrepreneurs in the deep learning space. We also offer a weekly newsletter tracking deep-learning academic papers.


Sign up for our weekly newsletter.

 
Subscribe
 

WEEKLY NEWSLETTER | Research Watch

Week Ending 11.9.2025 — Newly published papers and discussions around them. Read more