llm

Kimi K2

Kimi K2 Thinking Beat GPT-4 and No One's Talking About It

Dishant Sharma

•

Dec 5th, 2025

•

6 min read

Kimi K2 Thinking Beat GPT-4 and No One's Talking About It

GPT-4.1 scores 54.6% on SWE-Bench Verified. It's the benchmark where models actually fix real GitHub issues. Not toy problems. Real bugs from real repos.

Kimi K2 scored 65.8%.

An open-source model just beat OpenAI's flagship on the hardest coding test we have. And no one's talking about it the way they should be.

K2 isn't better because it has more parameters. It's better because it was trained to think differently. Most LLMs think first, then act. K2 thinks while acting. Called "interleaved thinking." Sounds fancy. It's not.

Here's what it means. You give GPT a task. It thinks. Writes a plan. Calls a tool. Done thinking.

K2 calls a tool. Thinks about the result. Calls another tool based on what it learned. Thinks again. Keeps going. It can do 200 to 300 tool calls in one session. Each step builds on the last. No reset button.

And people are noticing. One developer on Reddit said the writing quality is impressive. Another called it "funny and great". But buried in a thread about temperature settings, someone pointed out what matters. The interleaved reasoning. That's the part everyone overlooks.

Why This Actually Matters

Most models you prompt with system instructions. Tell it what to do. How to think. What format to use.

But you're still guessing. You don't know what tools it needs until it runs. Then you add them. Then it fails. Then you add more instructions.

K2 was trained on agentic data from the start. Not chat logs. Not Q&A pairs. Multi-step problems where the model had to use tools to survive. Then they ran reinforcement learning. Made it practice planning and executing.

The result? It doesn't need you to explain how to be an agent. It already is one.

I used to think tool calling was about function schemas. Define your tools. Pass them in. The model picks one. You run it. Done.

Wrong.

That works for one tool call. Maybe two. But when you need a model to search the web, parse results, search again based on what it found, then write code using those results? The schema isn't the problem. The model is.

Most LLMs lose the thread after three tool calls. They forget what they were doing. Or they call the same tool twice with the same params. Or they just give up and hallucinate an answer.

K2 doesn't. Because it's not just calling tools. It's reflecting on them.

What Reflection Actually Looks Like

Here's a question people always ask. What's the difference between chain-of-thought and interleaved thinking?

Chain-of-thought is one long monologue. The model thinks. Writes steps. Arrives at an answer. If step 2 was wrong, too bad. The whole chain is tainted.

Interleaved thinking pauses. The model thinks. Acts. Sees the result. Thinks again based on reality. Not based on its guess about reality.

One bad step doesn't spoil the batch.

There's a video where someone asks whether K2 keeps reasoning across multiple turns or just one. Good question. The answer matters.

It keeps reasoning within a session. Not forever. But for as long as the task needs. Could be 10 tool calls. Could be 200.

And it doesn't just stack results. It verifies them. Checks if its own logic makes sense. Rolls back if it doesn't. Self-correction without you asking for it.

Most tutorials tell you to add "verify your work" to your prompt. K2 was trained to do that by default.

Training Makes the Agent

The first time i read about K2's training, i missed the important part. They used 15.5 trillion tokens. Big number. Everyone talks about it.

But that's pretraining. The interesting stuff happened after.

They built a pipeline to generate agentic data. Tasks where the model had to search, plan, code, test, debug. Then they fed that data back into training.

Then came RL. Not RL to make it sound friendly. RL to make it survive real environments. Synthetic and real.

What actually happens is this. The model learns that tool calls have consequences. Bad tool choice? Task fails. Good tool choice? Get closer to the answer. Do that a few million times. You get a model that plans ahead.

It's not magic. It's training on the right data with the right incentives.

My coworker tried to build an agent last month. Used GPT-4 with tool calling. Spent two weeks writing system prompts. "Always verify." "Use tools in this order." "If X happens, do Y."

It worked 60% of the time. The other 40%? The model just ignored the instructions. Or followed them too literally and got stuck.

You can't prompt your way out of bad training.

The Part No One Mentions

K2 is a Mixture of Experts model. 1 trillion total parameters. Only 32 billion active per token.

Sounds technical. Here's why it matters.

You can run this locally. Not on a gaming laptop. But on a decent machine with quantization? Yeah.

Most "state-of-the-art" models are stuck in APIs. You rent them. You don't own them. You can't see what they're doing.

K2 is fully open. Weights. Code. Training details. You can load it. Run it. Break it. Fix it.

And the benchmark numbers aren't cherry-picked. 66.1 on Tau2-Bench. 76.5 on ACEBench. 47.3 on SWE-Bench Multilingual. Beats most closed models.youtube

On coding specifically? 53.7 on LiveCodeBench. Without "extended thinking" mode. Just the base model deciding when to think and when to act.youtube

That's better than models three times its size.

The Thing About Naming Models

Quick tangent. Why do Chinese labs always pick names like "Kimi" and "Moonshot"?

Western labs do Zeus. Titan. Olympus. Names that sound like they'll conquer you.

Chinese labs do Kimi. Sounds like a friend. Or a cartoon character.

Maybe it's cultural. Maybe it's branding. But it makes these models feel less intimidating. Which is weird. Because K2 is probably the most capable open agent model right now.

You'd think they'd name it something that sounds tougher. Instead they named it like a pet.

Who This Isn't For

Most people don't need this.

If you're building a chatbot, use something smaller. K2 is overkill.

If you need one or two tool calls, GPT-4 is fine. Faster. Cheaper per token for simple tasks.

K2 is for the edge cases. The tasks where you need 50 tool calls. Or 100. Where you need the model to search, fail, search again, code, test, fail, debug, and finally work.

Software engineering agents. Research assistants that browse the web. Complex automation that doesn't just follow a script.

And here's the thing no one says. K2 burns through reasoning tokens. 130 million tokens on complex tasks compared to GPT's 82 million. Tokens are cheaper. But you're using way more of them.

If you're not careful, it costs more than GPT.

There's also this. Some Reddit users say the temperature is too high out of the box. Words are grammatically correct but lack coherence. You have to tune it.

It's not plug-and-play for every use case.

What Sticks With Me

The SWE-Bench number is what i keep coming back to. 65.8%.

That's not a party trick. That's a model fixing real bugs in real codebases. With tools. Without a human holding its hand.

Two years ago that number was 10%. One year ago it was 30%. Now it's 65%. And it's open source.

Makes you wonder what next year looks like. Or next month.

Enjoyed this article? Check out more posts.

View All Posts