The 90/10 AI Stack: Open-Source Models for Daily Work, Frontier for the Hard Problems
Open-source models like Qwen3.6-35B-A3B, gemma-4-31B-it, Qwen3-Coder-Next, and DeepSeek 4.0 have crossed the daily-workload quality bar at a fraction of frontier-model cost. The practical move is a specialized router that sends the easy 90% to open-source and reserves frontier models for the work that truly needs them.
For most of 2024 and 2025, the LLM conversation was one-way. Bigger models. Better benchmarks. New frontier capabilities. Most of those headlines came from two companies.
In 2026, the conversation has split in two — and that split changes the economics of every AI decision your engineering organization will make this year.
The Inversion Nobody Announced
Look at what has shipped in just the last few months:
- Qwen3.6-35B-A3B — a Mixture-of-Experts model with 35B total parameters but only 3B activated per token. It runs on a single workstation-class GPU and matches the daily-task quality of models that cost an order of magnitude more per call.
- gemma-4-31B-it — Google's instruction-tuned variant that closed the gap on chat, summarization, and structured extraction, under a license you can actually deploy.
- Qwen3-Coder-Next — code completion and refactor quality that is, for the workloads our engineers actually run every day, effectively indistinguishable from the top-tier hosted coding models.
- DeepSeek 4.0 — the latest serious open entry, very close to the daily-need bar across reasoning, code, and chat.
None of these were positioned as frontier. That is the point. They were positioned as good enough for the job, with margin to spare. And that is exactly what most of our daily engineering workload needs.
Where Anthropic and OpenAI Are Going
Meanwhile, the frontier labs have not slowed down — they have changed direction. The newest releases from Anthropic and OpenAI are explicitly higher-order: longer reasoning chains, deeper tool use, larger contexts, more autonomous agentic loops. Harder math, harder code, harder multimodal.
That capability is real and it is expensive. Per-million-token prices on the top tiers have moved up, not down. That is defensible — what those models can do in a single call is genuinely different from a year ago.
But here is the catch: most of the work inside a typical engineering organization does not need frontier intelligence.
Classification. Extraction. Summarization. Boilerplate generation. Diff explanation. Standard documentation. Ticket triage. Internal Q&A. Most retrieval-augmented chat. Most validation. Most transformation.
For all of that, the open-source models listed above are not slightly cheaper. They are drastically cheaper — often by an order of magnitude when you self-host, and still 5–10x cheaper through managed open-model providers.
The Cost Gap Is Bigger Than the Sticker Price
Headline per-token pricing only tells part of the story. The real gap shows up when you measure full workload economics:
- Token efficiency. Frontier models often reason longer per answer. That is great for hard problems and wasteful for easy ones.
- Idle cost. A self-hosted open model on your own GPU has a flat hardware cost regardless of call volume. Past a certain throughput, the marginal cost of an inference call approaches zero.
- Data residency. When the model lives inside your VPC, you stop paying the architectural tax of redacting, masking, or routing around external endpoints for sensitive payloads.
- Latency. Smaller models, served locally, are simply faster for the 80% of tasks that do not need deep reasoning.
We have been running this math internally at GeekyAnts, and the picture is consistent. Frontier models are the right choice for a clearly bounded set of high-leverage tasks. They are the wrong choice for almost everything else.
A Quick Map of Today's Open-Source Daily-Workload Models
| Model | What It Is | Where It Shines |
|---|---|---|
| Qwen3.6-35B-A3B | 35B params, 3B active (MoE) | Chat, RAG, structured extraction |
| gemma-4-31B-it | Google's instruction-tuned 31B | Summarization, internal Q&A, classification |
| Qwen3-Coder-Next | Code-specialized Qwen variant | IDE completion, refactor, diff explanation |
| DeepSeek 4.0 | Latest open competitor | Reasoning-lite tasks, multi-step agent steps |
None of these models is trying to win the hardest benchmark. They are each trying to win the most common workload, and that is a different and arguably more useful race.
The Specialized Router Is the Whole Game
This is where it gets interesting as an engineering problem.
If you accept that open-source models can handle 80–90% of daily tasks at a fraction of the cost, and that frontier models are still essential for the remaining 10–20% where the intelligence ceiling matters, the question becomes:
How do you route correctly, automatically, at scale?
The answer is a specialized router — a thin layer that classifies the incoming request and sends it to the right model. Not a dropdown. A real classifier with rules, signals, and feedback loops:
- Task type — code completion, structured extraction, long-context summarization, or open-ended reasoning?
- Risk tier — is the output auto-merged, auto-deployed, or auto-sent to a customer? Higher consequence, higher-tier model.
- Latency budget — is a human waiting on this in an IDE, or is it a background batch?
- Data sensitivity — does the payload contain customer data, source code, or regulated content? Route to an in-house model.
- Confidence escalation — if the small model's output fails validation or scores low on a verifier, escalate to a larger model instead of running everything on the larger one by default.
This is the pattern we are integrating across our engineering organization. The router becomes the single most valuable piece of AI infrastructure you own — because every percentage point of traffic you shift from frontier-tier to open-source is a direct margin improvement, multiplied across every team and every workflow.
I have written before about the gateway layer — the LiteLLM-style proxy that gives you one interface to many providers. The router sits on top of that. The gateway gives you reach. The router gives you economics.
In-House Hosting Is Finally Practical
For most of the last two years, "self-host the model" was a romantic idea that died on contact with reality. Hardware was scarce. Serving stacks were immature. The quality gap was real.
That has changed. With models like Qwen3.6-35B-A3B and gemma-4-31B-it, a serious team can stand up production-grade inference on hardware that is no longer exotic. Inference servers like vLLM, SGLang, and TensorRT-LLM have matured. Quantization and KV-cache techniques are well understood. The operational pattern looks much closer to running a database cluster than running a research lab.
What you get back in return:
- Data never leaves your boundary.
- Predictable cost — you have already paid for the GPUs; per-call cost is amortization plus electricity.
- Customization — fine-tune on your codebase, your tickets, your domain language.
- No rate limits, no provider outages, no model-deprecation surprises.
That is a fundamentally different shape of risk from "we depend on one provider's API for a critical workload."
What This Doesn't Mean
A few honest caveats, because the open-source ecosystem is not pretending to win on every dimension yet:
- The hardest reasoning, the deepest agentic loops, the longest contexts, the most demanding multimodal — frontier models still lead, and the gap on those workloads is real today.
- DeepSeek 4.0 and its peers are at the level the market practically needs, but they are not yet at frontier capability on every benchmark. That is a feature, not a bug — most workloads do not need frontier capability.
- Open weights are not the same as an open process. You still need governance, observability, evaluation harnesses, and a security model.
This is not a story of open-source replacing frontier. It is a story of open-source being good enough for the 90%, and frontier being worth the premium for the 10%, and the router being the part of the system that makes that economic split actually work in production.
A Practical View of the Next Few Years
The shape I expect to see:
- Frontier models keep climbing the intelligence ceiling and getting more expensive at the top, while entry-tier prices stay roughly where they are.
- Open-source models keep eating the middle and lower workloads with quality that keeps creeping upward and cost that keeps trending toward zero.
- Specialized routers and gateways become standard infrastructure — the way load balancers and API gateways became standard a decade ago.
- More intelligence at the top of the stack means more verticals become economically viable for AI — legal, healthcare, supply chain, manufacturing — but the workloads inside those verticals will still be served mostly by the smaller, cheaper models.
The organizations that win the next phase are not the ones running the most expensive model. They are the ones routing the most workload to the cheapest model that is still good enough — and reserving the expensive model for exactly the work that justifies it.
The Move If You Lead an Engineering Org
Concrete next steps:
- Audit your AI spend by task type. You will be surprised how much of your bill is going to frontier models for tasks a 30B open model would handle just as well.
- Pilot a self-hosted open model on one or two clearly bounded workloads — IDE completions, internal RAG, ticket classification — and measure the cost delta with real traffic.
- Build the router before you need it. A thin classification and routing layer in front of your model calls is the highest-ROI piece of AI infrastructure you can ship this quarter.
- Track quality the way you track cost. Without an eval harness, you cannot defend any routing decision.
We are doing this work across our engineering organization at GeekyAnts and with our clients. The economics are not subtle. The teams that build this pattern early will compound the savings. The teams that do not will be paying frontier prices for daily-grade work for a long time.
The model layer is no longer a single choice. It is a portfolio. Treat it like one.
*If you want to compare notes on routing strategy or how we are running this internally at GeekyAnts, reach out at pratik@geekyants.com or geekyants.com.*