Learn AI Engineering with the 6 AI Engineering Patterns:Β Get the one-page summary
Leverage
AI Memory

Yash Gaitonde, Memory Engineer @ Cursor

Yash, the sole engineer prototyping the "Memories" feature at the multi-billion dollar IDE company Cursor.

In this interview, Yash, the sole engineer prototyping the "Memories" feature at the multi-billion dollar IDE company Cursor, provides a look into their development process. He details the technical challenges and user experience learnings from building a memory system that enhances the agent's context without hijacking the user's control.

Insights

  • A bad memory is worse than none: The most frustrating user experience is an agent that confidently "doubles down" on incorrect information it has remembered.
  • Models have their own definition of "memory": Anthropic's models, like Claude 3 Sonnet, are good at reflection but tend to save task-specific logs rather than the generalizable knowledge a coding agent needs.
  • Over 90% of conversation is noise: The vast majority of an interaction with a coding agent is specific to the immediate task and should not be stored in long-term memory.
  • Users aren't the best memory curators: What users believe the agent should remember is not always what is most useful for improving future performance.
  • New models excel at reflection: Modern LLMs can understand a user's natural-language corrections to update or delete a faulty memory without complex tooling.
  • Evaluation focuses on generation quality: Cursor found it more practical to evaluate whether the agent was saving high-quality, non-task-specific memories rather than creating contrived retrieval tests.
  • Memory is kept in the background: To avoid user anxiety about curating a "perfect" memory bank, generation is mostly hidden, with options to edit or delete memories when the agent cites them.
  • The North Star is team-wide intelligence: The ultimate goal is for an agent to learn from one user's interactions and share that knowledge, preventing teammates' agents from making the same mistakes.

The Big 3

1. The Philosophy of Agent Memory

  • Cursor breaks down "context" into three categories: Directional (to narrow search), Operational (runbook-style rules), and Behavioral (the agent's personality).
  • The primary goal of the Memories feature is to automatically augment these context types, reducing the need for manually written rules.
  • The project's guiding principle is to be conservative and prioritize user control, as an agent that stops listening or goes "off the rails" is the biggest point of friction.
  • An incorrect memory is a critical failure because the agent will stubbornly refuse to follow new instructions, believing its memory is the source of truth.

2. Prototyping Competing Architectures

  • Two main approaches were tested in parallel: a sidecar approach using a separate, smaller model to observe conversations and a tool-call approach where the main agent could directly update its own memory.
  • While powerful models like Claude 3 Opus made the tool-call method viable, they had a bias towards creating "task logs," which was the opposite of Cursor's goal for generalizable knowledge.
  • The sidecar model required precise and aggressive prompting to teach it to ignore the 90%+ of conversational content that was task-specific and not worth remembering long-term.
  • The final implementation is a hybrid: the sidecar model generates memories in the background, while the main model's reflection capabilities are used to let users correct faulty memories with natural language.

3. Evaluation -> User Experience

  • Evaluating memory's impact is difficult; its absence is more noticeable to a user than its presence.
  • Because of this, the team focused on measuring the quality of the memories being generated, specifically the system's ability to filter out useless information.
  • A key UX learning was that models are robust enough to handle "noisy" memory banks with some junk data, and users' desire for a perfect memory log was counterproductive.
  • The future roadmap is to expand memory beyond user preferences to include deep knowledge of a specific codebase and, eventually, to share these learnings team-wide.

Transcript

00:00:12 Yash Gaitonde: Hey, guys. Cool. Yes. So my name is Yash. I work at Kershaw on the engineering team. And, I'll apologize in advance because we don't have anyone who work who has slides, so I had to just kinda put these together on my own. So, you know, bear with the very poor design. But anyways, so I'm gonna talk a little bit about how we approach context generally. Wait. Yash, you're missing one key point Yes. And, I have been working on memories, by myself at Cursor for the past few months, just prototyping and we're kind of making our way out of the woods. So gonna talk about that.

00:00:30 Greg Kamradt: Okay. So Yash isn't bragging for himself as much as he wants to ask them. How many people are working on memory in Cursor? And he goes, I'm the only one. So we're talking about a $9,000,000,000 IDE here where memory is a load bearing process for it and Yash is the one that's working on it. So very excited to see this.

00:00:49 Yash Gaitonde: Thank you so much. Thank you. But yeah. So I think, like, very similar to, what other folks have touched upon. When you try and look at context as this big kind of word, I think it gets really really confusing and muddy. And so we kind of tried to break it down into three different categories for our agent. And obviously, this is very much an art and not a science, so and it's very product dependent as well.

00:01:25 Yash Gaitonde: But for us, we kind of have three different types. So you have directional, and the idea there is, when you present a high level task to the agent, usually, like, it will kind of begin its search wide, and directional context will allow it to narrow that search early and earlier and kind of get to the relevant set of files as quickly as it can. The next thing is operational and that's kind of runbook related. So how do I deploy service? How do I make edits in this particular file? What are the conventions? And so we have already created cursor rules for that, which are written by a human, but the model will basically fetch them when they're relevant to context and use that to guide its edits.

00:02:04 Yash Gaitonde: And the last thing is a little bit more fuzzy. It's kind of similar to the holistic theory of mind that Sam mentioned earlier. And that's kind of like a behavioral context where you want the model to act in a certain way when it's going back and forth with you. And so we had this concept of user rules where you could specify, okay, please only speak to me in Spanish or something like that. But to be honest, most of our efforts so far has been just focused on the first category of code based search. And recently, we've kind of been branching out, is what I've been working on. And so the goal of memories is is basically to augment all three of these types of context with the idea that, you know, you can rely less and less on cursor rules, less and less on user rules, and maybe even less and less on code based search once we've learned more and more about how you interact with the agent.

00:02:51 Yash Gaitonde: Okay. And So like most things at Cursor, we start with prototypes. So I've been prototyping memories for about a month and a half now, and our general principle was to start pretty conservative. And the reason for that is because Cursor is a coding agent where the user generally likes to feel in control. And one of the most frustrating things about using Cursor is when it kind of goes off the rails and stops listening, which I'm sure if any of you have used it, you must have experienced it before. And so one worry and one thing that we saw a lot during prototyping is if the model gets an incorrect memory about your code base or about how you like to run the terminal or anything really.

00:03:29 Yash Gaitonde: It's a really frustrating experience, because the model will refuse to do certain things or will do things incorrectly. And then when you try and tell it like, hey, this isn't the right way, it will actually like double down because it's like, oh, no. I have a memory. Like, this has gotta be correct. And so I'll get to that part later, but basically, when we started prototyping, we began with kind of two parallel approaches that I tried out. So one we called like the sidecar approach, and this is where the model doesn't call any tools to generate memory, but rather in the background, as you interact with the agent, we kind of pull relevant parts of context into a smaller model, and that smaller model makes a decision on what to save if to save anything at all, and also what to update in the existing, knowledge that it has.

00:04:14 Yash Gaitonde: And then the second approach we tried was the tool call approach where you just give the model a tool, called update memory, and the model kind of as it detects you interacting with the agent, will decide on its own, you know, this memory is incorrect. I'm gonna update it or I'm gonna delete it or I'm gonna add something new. So in the second one, the okay. So in the sidecar approach, you basically have a small model listening to your conversation, and at some points, you basically send the conversation off to that small model. And the small model makes the decision completely independent of the main of the main thread.

00:04:51 Yash Gaitonde: And then versus the tool call approach is all in the main thread. The sidecar doesn't necessarily have to be an Like, you can choose to give it tools if you want to, but you could also we've tried we've tried both. So yeah. Also, feel free to interrupt me with questions at any point. No. No. No. That was great. I'm sure other people had questions too. And so, yeah, I'll start with tool call memories. This was definitely the simplest thing to to implement on. And so you just have the model kind of reflect, and decide when the user has expressed something to it that it determines is, like, worth remembering.

00:05:27 Yash Gaitonde: And in that case, it will create a memory itself. And so what's interesting is we noticed like as I was prototyping over the past few months, model capabilities have improved in such a way that this is like a legit approach where you can just give the tool to the model and it works. And so, specifically, Sonnet four and Opus four as well are really good at instruction following, and they've also been r l'd specifically on Anthropic's definition of memories. But the tricky thing is Anthropic actually has a different definition of memories than what we have. And it kind of goes to the point that everyone has you need to decide what memories are for your product.

00:06:02 Yash Gaitonde: You can't just think of it in the abstract. And what Sonnet wants to do with memories is kind of create a task log. So you can see that in like their Pokemon example where they where the agent kind of keeps a log of all the things that it's tried to do. And so when we just gave the gave Sonnet four a memory tool, it would try and save like task specific memories, which is encoding agents, like, at least for us, it's kind of the opposite of what you want because you don't wanna remember like the things that were specific to one particular conversation. You wanna remember the things that were generalizable and will be useful in the future like completely unrelated generation.

00:06:35 Yash Gaitonde: And to that end, we've now, like, kind of tried to keep Sonnet four and keep, Opus four from generating new memories, but they are still really good at reflection. So, if the model has an incorrect memory and the user just in natural language kind of expresses disagreement, then they're really good at just updating their memory themselves without having to do anything fancy in the background. And the second approach is the sidecar which I was talking about. And so, I spent weeks iterating here on the prompting. It was like a whole roller coaster up and down where I kind of lost hope and got hope back and kind of settled in somewhere where I'm happy.

00:07:12 Yash Gaitonde: And the biggest thing that I was fighting is this concept of like a task specific memory because in an a coding agent interaction, probably 90 plus percent of what you're saying to the agent is not worth remembering. It's very specific to the task that you've given the agent. And so, probably in most conversation, there isn't even anything worth remembering for this particular type of memory. It depends on what you wanna remember. And so I tried a bunch of different approaches to basically keep the model from focusing too much on task specific things. And similarly, like with the, tool calls, it kind of changed as the models got better.

00:07:47 Yash Gaitonde: So, you know, back when Cloud 3.7 was the latest model, I was kind of prompting it super aggressively, by giving it a bunch of examples on it sort of worked, it also wasn't super great and a lot of things kind of slipped through the cracks. And with the newer set of reasoning models, essentially, you can just kind of give them like a very brief description of the problem. You can lay it out very you have to lay it out very precisely, like every word will matter, but you don't actually need that much text. You don't need that many examples, and they'll do a pretty good job at the task. So eventually, like, we didn't end up with a super complicated system for the sidecar model.

00:08:26 Yash Gaitonde: It was very simple. And the last thing was evaluation. So, evaluating memories in our experience has been really tough because it's the type of feature that you notice when it's taken away, not when it's like necessarily there. And so, you can try and think of evals where like the memories would help you get to the solution faster, but in some sense, it's kind of cheating because you come up with the examples such that the memories are useful. And so when we evaluated the memory's sidecar model, like I focused it basically on the quality of the memory generated, not on the retrieval side. And specifically, mostly just filtering out these task specific memories.

00:09:04 Yash Gaitonde: And so we ended up in a place where we're pretty happy with the quality of memories that were getting generated. And then the next big question was kind of UX, which is, how much do you wanna expose your memory bank to your users? And I think one big learning has been what users think the model should remember are not necessarily what are useful. And especially like users get really frantic and think that their memory bank has to be perfect, which is like so far from the truth because in reality, even if you give a model like 50% memories, 50% of them are junk, 50% of them are applicable, like the models are smart enough now to kinda filter out the noise largely.

00:09:41 Yash Gaitonde: And so that was like a really interesting thing. And so as a result, we've kind of kept the generation a bit hidden. And like, obviously, you can go in and change it if you don't like it, but most of the editing of memories happens like when a model will choose to cite a memory in its generation. And then you can kind of hover over and delete. But on the actual generation side, we don't use it on. Cool. But yeah. So it's kind of what I've been talking about, but what's next. So so far now we can understand the users, but we wanna learn a lot more about your code base. And in particular, like, we want to have an understanding of the things that happen in your codebase outside of just the code because, like, you know, we can churn through and, embed your entire codebase and probably get a decent representation of just what's happening, but there's a lot of things that you do with your code base that, you know, you express in the sidebar, like, when you're chatting with the agent, but they don't live anywhere.

00:10:29 Yash Gaitonde: And so we're trying to figure that out, basically. Gonna be a lot more prototyping and just, like, you know, shooting in the dark, but we'll I'm confident we'll get there. And then with that kind of new set of memories where they can't just be applied all the time in context, we need new approaches to including memories in context. And so, there are some things we've been experimenting there with changing the way that our search pipeline works to include memories, but it's all just prototypes so far. And then kind of the north star we're shooting for is team wide knowledge where, if if someone can learn from my mistakes, that I made with the agent and they don't their agent won't make that same mistake, like, that's the North Star.

00:11:09 Yash Gaitonde: But it's also, like, really really important to get it right because if you start sharing bad memories team wide, it's like it's a huge degradation of quality. So this is what we're working towards, but, you know, we still got some ways to go. But, yeah, I had a q and a slide, but I think we're gonna save now. You, guys.

On this page