xiaomi

llm

Xiaomi's MiMo-V2-Flash: A 309B Model That Nobody Saw Coming

Dishant Sharma

•

Dec 21st, 2025

•

6 min read

Xiaomi's MiMo-V2-Flash: A 309B Model That Nobody Saw Coming

Xiaomi just dropped a 309 billion parameter AI model that nobody saw coming. People on Reddit woke up to find a phone manufacturer competing with OpenAI and Anthropic. The model is called MiMo-V2-Flash. It's open source. And it's fast.

The specs sound ridiculous at first. 309 billion total parameters but only 15 billion active at once. It runs at 150 tokens per second. That's faster than Claude Sonnet 4.5 according to their efficiency charts. And the pricing makes you double-check the decimal point. Ten cents per million input tokens. Thirty cents per million output tokens.

You've probably seen this pattern before. Big tech company releases model. Everyone gets excited. Then reality hits. But this one feels different. The community reaction isn't hype. It's confusion mixed with genuine interest.

Why a phone company built this

Xiaomi makes smartphones and rice cookers. Not exactly the first place you'd look for frontier AI models.

But they've been quietly building an AI team. The model uses something called Mixture-of-Experts architecture. Think of it like having 309 billion parameters sitting there, but you only wake up 15 billion of them for each request. The rest stay dormant. This is how they get the speed.

The interesting part is the attention mechanism. Most models use full attention or sliding windows of 2048 to 4096 tokens. Xiaomi went with 128 tokens. That's aggressive. Almost reckless. They compensate with something called "attention sink bias" that keeps long-context performance from falling apart.

They trained this thing on 27 trillion tokens using FP8 precision.

The context window supports 256k tokens. That's not record-breaking but it's solid. You can run hundreds of agent interactions without losing track.

The multi-token prediction trick

Here's where it gets technical. And weird.

Traditional language models predict one token at a time. Next word. Next word. Next word. MiMo-V2-Flash predicts multiple tokens ahead. They built a lightweight module with only 0.33 billion parameters per block. It uses dense feed-forward networks instead of the MoE architecture.

This triples the output speed during inference. But here's the catch. The module uses sliding window attention instead of global attention to keep parameters low. Most tutorials tell you this would hurt accuracy. Somehow it doesn't tank performance here.

I used to think multi-token prediction was just speculative decoding with a new name. Turns out there's more to it. The system maintains sampling consistency across turns and avoids re-computation. That matters for agent workflows where the model needs to stay coherent over long conversations.

The acceptance rate sits between 2.8 and 3.6 tokens with a 3-layer MTP setup. Effective speedup lands at 2.0 to 2.6 times faster than traditional generation.

Benchmark wars

MiMo-V2-Flash scored 94.1 on AIME 2025. That's close to Kimi K2's 94.5. On SWE-Bench Verified, it hit 73.4 percent to beat all open-source competitors. LiveCodeBench v6 shows 30.8 percent. MMLU-Pro sits at 73.2.

Numbers don't tell the full story though. Performance varies by task type. The model does well on reasoning and coding benchmarks. Writing quality is competitive but not groundbreaking.

Long-context evaluations show it beating Kimi K2 Thinking despite K2 being a larger model with full global attention. That hybrid sliding window attention architecture actually works.

But look. Benchmarks are benchmarks. Real-world use is different. The question is whether this thing hallucinates less or handles edge cases better than models with similar scores.

The agentic focus

Xiaomi trained this model specifically for agent workflows. That's coding tasks, tool calling, multi-turn conversations.

They built something called Multi-Teacher On-Policy Distillation combined with large-scale agentic reinforcement learning. The technical paper doesn't explain exactly how that works. But the results show up in agent benchmarks.

The model supports what they call "hybrid thinking mode". You can toggle whether it thinks through problems or answers instantly. Useful for when you want speed versus when you need deeper reasoning.

It generates functional HTML webpages in one shot. Works with vibe-coding tools like Cursor and Cline. The 256k context window means it can maintain state across hundreds of tool calls without forgetting what it's doing.

Infrastructure details

The deployment uses SGLang for optimization. They built a two-layer toolbox with Ray actor pools to handle resource contention. Cold-start delays get eliminated through persistent tool execution.

They also added a fine-grained data scheduler that works at the sequence level instead of micro-batches. Combined with partial rollout, this cuts GPU idleness from long-tail stragglers.

Random thought about phone companies

It's strange watching Xiaomi compete in AI. They make budget smartphones that work surprisingly well. Everyone knows them for copying Apple's design language while undercutting on price.

And now they're training models on 27 trillion tokens. Building infrastructure for agent workflows. Publishing open-source weights on Hugging Face.

The pattern repeats across Chinese tech companies. DeepSeek released V3. Moonshot AI built Kimi K2. ByteDance has their models. All open source or near-open. All competitive with Western frontier labs.

You start wondering if hardware companies have an edge. They understand inference at scale. They know power efficiency. They've been optimizing chips for years. Maybe that knowledge transfers to model architecture decisions like aggressive sliding windows and hybrid attention patterns.

Who shouldn't bother

This model is overkill for most use cases. If you're building a simple chatbot or doing basic text generation, you don't need 309 billion parameters.

The aggressive 128-token sliding window means certain tasks might not work well. If your use case needs consistent full attention across long sequences, this architecture fights against you. The attention sink bias helps but it's a workaround, not a solution.

Chinese language performance is strong but not as good as Kimi K2. If you're working primarily in Chinese, that matters. CMMLU scores show 87.4 for MiMo versus 90.9 for K2.

And honestly, the speed claims need real-world testing. Benchmarks in controlled environments are one thing. Production workloads with variable request patterns are different. The 150 tokens per second might not hold up under load.

What this means

Xiaomi released this model December 16, 2025. People expected announcements from Google and OpenAI that week. Instead a phone manufacturer jumped into the race.

The model is available on Hugging Face right now. Weights are open. You can download and run it locally if you have the hardware. Or use it through API providers at those ridiculous prices.

Community response on Reddit ranges from skeptical to impressed. Some users immediately started comparing it to Claude Sonnet 4.5. Others are waiting to see real-world performance before getting excited.

The technical architecture is genuinely interesting though. That 128-token sliding window shouldn't work this well. The multi-token prediction speedup is significant. And the agent-focused training shows a clear use case beyond general chat.

Makes you wonder what other hardware companies are training in secret. Samsung? LG? Huawei? The barrier to entry keeps dropping. Training costs come down. Architectural innovations spread fast through research papers.

MiMo-V2-Flash might not be the best model out there. But it's another data point showing the field is wide open. And that's worth paying attention to.

Enjoyed this article? Check out more posts.

View All Posts