Local LLMs Are Actually Useful Now


Someone on Reddit spent three years avoiding local LLMs because they thought the tech wasn't there yet. Then they ran Qwen3 on a CPU and got 3 tokens per second. Three years of API bills when they could have been running models on hardware they already owned.
That's the thing about local LLMs right now. They work. Not in a "maybe someday" way. They work today.
Why people stopped waiting
A guy posted on r/LocalLLaMA in April asking why anyone bothers with local models in 2025. The answers weren't about hobbyists tinkering. They were about real problems.
Privacy came up first. Not the abstract "i value my privacy" kind. The "i can't send client data to OpenAI's servers" kind. One person said they wanted to fine-tune models on their own data without worrying who sees it.
Then cost. API bills add up fast when you're running agents that make hundreds of calls. Local models cost nothing after the initial hardware. One analysis found businesses cut costs by 75% over time by hosting locally.
But here's what surprised me. Speed.
Someone reported Qwen3 30B running faster than API calls for simple tasks.
No network latency. No rate limits. Just instant responses for things like fixing grammar or answering quick questions.
The models that actually work
Qwen3 keeps coming up. It's the model people mention when they want something that just works.
The 30B model runs on CPUs and hits 3 tokens per second. That sounds slow. It's not. For most tasks, you're reading faster than that anyway. And on a GPU? People report 10 to 15 tokens per second on mid-range cards.
Llama models are faster for complex tasks like coding. About 3x faster than Qwen on those workloads. But Qwen wins on accuracy.
Phi-3 and Mistral show up for specialized uses. Phi-3 is small enough to run on a laptop but still handles real work. Mistral excels at catching positive cases but throws more false positives.
Here's what broke for me the first time: i tried running a 70B model on 8GB of VRAM. It loaded. Barely. Then crawled at 0.5 tokens per second. Unusable.
The fix was obvious in hindsight. Use a smaller model or quantize it. The q4_k_m quantized version of Qwen3 runs at 26 tokens per second on the same hardware.
Why specialized agents matter now
Most people don't need one giant model anymore. They need three small ones that are good at specific things.
One for customer support. One for data processing. One for code review.
Microsoft calls these "orchestrated agents" and built a whole service around the idea. But you don't need Azure to do this. You can run multiple local models and route tasks based on what each does best.
Think about it like this: you wouldn't hire one person to do sales, support, and engineering. Why use one model for everything?
Someone built a system where one agent summarizes documents, another extracts structured data, and a third handles user questions. Each model is small. Each is fast at its job. Together they handle workflows that would cost hundreds monthly in API calls.
When it's annoying
Look, local LLMs aren't perfect. The setup is annoying.
You need to understand model formats. GGUF, GPTQ, AWQ. You need to pick quantization levels. You need to configure context windows and KV cache settings.
One person said they could have made more progress just using APIs and focusing on their actual product. They're not wrong.
And performance still lags behind GPT-4 or Claude for complex reasoning. If you need cutting-edge performance, APIs win.
Here's the honest take: most people don't need this. If you're building a simple chatbot or doing light text work, just use an API. Local models make sense when privacy matters, costs scale badly, or you need fine control.
That coworker who runs everything local
i know someone who runs local models for everything. Email drafts. Code suggestions. Even grocery list optimization.
He showed me his electricity bill once. Running models 24/7 costs him about 30 dollars monthly. His old ChatGPT Plus subscription was 20.
i didn't have the heart to tell him he's paying more to run worse models. But he's happy. And his data never leaves his network. That matters to him more than the 10 dollar difference.
What actually changed
The models got smaller without getting dumber. That's the real shift.
Two years ago you needed enterprise hardware to run anything useful. Now Phi-3 runs on phones. Qwen3 runs on CPUs. Decent performance on hardware most developers already own.
Quantization got better too. The quality gap between full precision and 4-bit quantization is barely noticeable for most tasks. But the speed and memory differences are massive.
And tools improved. Ollama makes running models stupidly simple. One command and you're generating text. No configuration files. No compilation. It just works.
The frameworks for building agents matured. LangChain, AutoGPT, CrewAI. You can orchestrate multiple models without writing everything from scratch.
Real numbers
Someone tested Qwen3 performance across different setups. On a 3090 GPU: 65 tokens per second with Ollama, 75 with llama.cpp. After patches: 125 tokens per second.
That's fast enough for real-time applications. Fast enough for agents that need to make dozens of calls.
Another person runs Qwen on a Ryzen 5 3600 with 8GB VRAM and gets 10 to 15 tokens per second. That's a 300 dollar used graphics card.
The cost savings are real. One business analysis found local hosting cuts costs 75% long-term versus API subscriptions. Initial hardware investment pays back in months if you're running high-volume workflows.
Who's actually using this
Privacy-conscious businesses went first. Law firms, healthcare providers, anyone handling sensitive data.
Then small teams who needed to control costs. Running agents 24/7 on APIs gets expensive fast. Local models cost the same whether you make 10 calls or 10,000.
Now developers are building products on local models. Personal knowledge management tools. Data analysis agents. Task automation systems.
Industries like retail use local LLMs for region-specific marketing while keeping customer data on-site. Logistics companies integrate them with data lakes to optimize supply chains.
The part that matters
i spent months thinking local LLMs were a future thing. Something to watch but not use yet.
Then i tried Qwen3. It answered coding questions fast enough that i stopped switching to ChatGPT. Not because it was better. Because it was right there. No context switching. No waiting for API responses.
That's what changed. Local models crossed the "good enough" threshold. They're not the best at everything. But they're good enough for most things. And they're only getting better.
Enjoyed this article? Check out more posts.
View All Posts