memory

agentic AI

LLMs Can't Remember Anything. Mem0 Might Help. Or Not.

Dishant Sharma

•

Nov 29th, 2025

•

6 min read

LLMs Can't Remember Anything. Mem0 Might Help. Or Not.

A GitHub repo hit 29,000 stars in under a year. Raised $24 million in funding. API calls jumped from 35 million to 186 million in just two quarters. And then someone published a blog post saying the whole thing might be built on flawed benchmarks.

That's mem0.ai. A memory layer for AI applications that promises to solve one of the biggest problems in building with LLMs. And depending on who you ask, it's either the future of AI infrastructure or an example of how easy it is to manipulate benchmarks.

I spent the last week reading everything i could find about it. Reddit threads. GitHub issues. Research papers. Even the drama between competing companies. Here's what i found.

The problem everyone has

LLMs don't remember anything. Every time you start a new chat, you're talking to someone with amnesia. They need the full context every single time.

You know this if you've used ChatGPT. You have to remind it about your project. Your preferences. What you talked about yesterday. It's annoying.

But it's worse than annoying. It's expensive. Those reminders are tokens. Tokens cost money. If you're passing hundreds of thousands of tokens on every query, your bill gets wild fast.

One developer reported memory optimization cut their token usage by 20-40%.

And even if you don't care about cost, there's the performance hit. More tokens mean more latency. Your users wait longer. And there's this thing called "lost in the middle" where the model just forgets stuff buried in a massive context window.

What mem0 actually does

Three lines of code. That's the pitch.

You add mem0 to your AI app and it handles memory for you. It extracts important stuff from conversations. Stores it. Retrieves it when needed. Updates it when facts change.

Behind the scenes it uses a hybrid setup. Graph databases. Vector stores. Key-value stores. All working together to remember what matters and forget what doesn't.

The examples sound good. A healthcare chatbot that remembers patient allergies. A customer support agent that recalls your last ticket. An AI assistant that knows you hate small talk and prefer bullet points.

But here's where it gets messy.

The benchmark war

In April 2025, mem0 published research claiming they were state-of-the-art. They said they beat everyone on a benchmark called LoCoMo. Including a competitor called Zep.

The number they threw out was 26% improvement.

Zep saw this and basically said "wait, what?" They published their own analysis showing that when you run the test correctly, Zep actually outperforms mem0 by 10%. Maybe even 24% depending on the setup.

The whole LoCoMo benchmark might be flawed.

Zep said mem0 implemented their system wrong for the test. Used sequential searches instead of concurrent ones. Compared apples to oranges.

And the validity of LoCoMo itself became a Reddit debate. People pointed out that slight tweaks to the experimental setup completely flip the results.

This is the kind of drama that makes you wonder what to trust. Both companies are pushing their own numbers. Both claim the other one cheated. And developers are stuck in the middle trying to figure out what actually works.

What developers actually say

On Reddit, the reactions are mixed.

Someone forked mem0 a month after launch and called it Jean Memory. They got 300 users. Paying customers. The fork became more popular than most standalone projects.

Why fork it? The creator said they were frustrated that tools like Cursor and Claude didn't remember context. They wanted something open and cross-platform. Something you could self-host.

The initial launch had server failures and a confusing interface.

People dropped off. But then someone made a video explaining the tool and signups jumped. That tells you the idea resonates. The execution needed work.

In another thread, developers are asking whether to use mem0 or just build their own memory system. One person said they're already running a ChatGPT-style assistant and experimenting with a unified memory API.

That's the thing about infrastructure tools. If you're technical enough, you can roll your own. The question is whether the time saved is worth the integration headaches.

The GitHub issues

I looked at the open issues on mem0's GitHub. There are bugs. FAISS delete functions not working. 500 errors on memory endpoints. Validation errors with vector stores.

Standard stuff for an open-source project. But it reminds you this isn't magic. It's code. Code breaks.

One issue caught my eye. Someone tried to reproduce the results from the research paper and couldn't. Their numbers were worse than what the paper reported.

Makes you wonder how much of this is production-ready versus research-grade optimism.

The random observation about naming

Nobody talks about how weird it is that everything in AI infrastructure has a name that sounds like a typo.

Mem0. Zep. Letta. MemSync. It's like everyone decided vowels are expensive and just started dropping them.

I get it. Short domains. Easy to remember. But when you're trying to Google an error message and you type "mem0 vector store issue" it feels like you're searching in a different language.

And don't get me started on projects named after animals. There's a whole category of developer tools that sound like rejected Pokémon. This isn't one of them but i've noticed the pattern.

Who this isn't for

Most people don't need this.

If you're building a simple chatbot that resets every conversation, you're fine. If your AI assistant doesn't need to remember user preferences across sessions, save yourself the complexity.

Memory optimization is for scale. For apps with thousands of users having ongoing conversations. For systems where context matters more than starting fresh.

And even then, you need to ask if three lines of code is really three lines. Because production deployments mean Docker configs. Kubernetes setups. Monitoring. Security compliance.

Mem0 claims SOC 2 and HIPAA compliance. That's good if you're in healthcare or fintech. But it also means you're dealing with enterprise-grade infrastructure. Not everyone needs that weight.

This is overkill for small projects.

If your app doesn't justify the token savings, skip it. Use a simpler approach. Store chat history in a database. Let the LLM re-read it. Sometimes the naive solution is good enough.

What actually matters

The debate about benchmarks is loud. But i think it misses the point.

Mem0 has 41,000 GitHub stars and 14 million downloads. That doesn't happen by accident. Developers are clearly interested in solving the memory problem.

Whether mem0 is 26% better or 10% worse than Zep on some academic benchmark doesn't change the fact that both exist because LLMs suck at remembering things.

The real question is whether adding a memory layer actually improves your user experience. Does your chatbot feel smarter? Do your users notice? Are you saving enough on token costs to justify the integration time?

For some teams, the answer is yes. The Jean Memory fork proved that. People want their AI tools to remember. They're willing to pay for it.

For others, it's a solution in search of a problem. More infrastructure to maintain. More points of failure. More vendor lock-in if you go with the hosted version.

I still think about that Reddit post where someone asked if anyone was using mem0 versus rolling their own. The replies were split. Half the people loved it. Half the people said they'd rather control their own memory architecture.

That split tells you everything. This isn't a obvious win. It's a tradeoff. And you need to know your use case before you commit.

Enjoyed this article? Check out more posts.

View All Posts