Back to blog

The 2026 LLM Stack: How I Actually Choose Models for Production

6 min read

Every week someone asks me which LLM they should use. The honest answer is the one nobody wants to hear: it depends, and anyone who tells you otherwise is selling you something. I run AI in regulated environments — healthcare at MediVox, identity and security work at Skjld Labs, the multi-LLM platform at Jaydus.AI. The choice of model is never just about benchmarks. So here's how I actually think about it in early 2026.

The families that matter right now

The field has consolidated into roughly five serious players. Each is genuinely good at something, and pretending they're interchangeable is how you end up with a brittle production system.

OpenAI: GPT-5 family

GPT-5.5 landed in late April with the usual OpenAI cadence — incremental on intelligence, meaningful on token efficiency and latency. It is the most boring choice in the best sense: well-behaved on structured outputs, predictable function calling, an ecosystem of SDKs that just works. If your team is mid-level and you want fewer surprises, this is your default. The unified multimodal pipeline (text, image, audio, video in one architecture) is genuinely useful when you care about pipelines rather than benchmarks.

Anthropic: Claude Sonnet 4.6 and Opus

Sonnet 4.6 (released in February 2026) is, in my hands, the best model for agentic coding and computer use available to anyone with a credit card. By most published evals it is sitting near the top of SWE-bench Verified, and the gap between Sonnet and Opus has narrowed enough that I rarely reach for Opus unless I genuinely need the extra reasoning headroom. The 1M context window at standard pricing changed how I architect retrieval — for some Eir Tec workflows we just stuff the whole knowledge base in.

Google: Gemini 2.5 Pro

Gemini is the model nobody talks about and everybody quietly uses for the long-context jobs. 1M tokens (with 2M flagged as coming), strong native multimodality including audio, and the cheapest serious frontier pricing I've seen at $1.25 per million input tokens as of early 2026. When I'm doing video transcription pipelines or analysing entire repositories at MediVox, Gemini gets the call. Tool use has matured — the native MCP support in the API is a real ergonomic win.

Meta: Llama 4

This is the only family on the list you can run on your own metal, which makes it the only family that matters if you have data that legitimately cannot leave the building. Scout fits on a single GPU, Maverick is the workhorse, and Behemoth is there if you need open-weights frontier. The 10M context window in the open-weights world is genuinely new. Self-hosted, you can land $0.20–0.50 per million tokens — an order of magnitude under the closed APIs.

Mistral

The European choice, and I mean that practically, not patriotically. Apache 2.0 base models, EU-incorporated company, strong code generation. For Norwegian and German customers who read GDPR articles before they read benchmarks, Mistral plus an EU-region deployment removes an entire category of legal review.

The hosting question is the real question

Every architect skips this and every regulator asks about it. Here is the 2026 reality:

  • Azure OpenAI — still the path of least resistance for enterprises already on Microsoft. As of April 2026, OpenAI's exclusivity with Azure ended; the same models are now available on AWS Bedrock too.
  • AWS Bedrock — Claude on Bedrock with inference inside the AWS EU region is the cleanest GDPR story I can write today. Anthropic's own EU residency on Microsoft Foundry is still listed as coming 2026; until then, Foundry routes to Anthropic-operated infrastructure regardless of which Azure region you pick. Read the fine print.
  • AWS European Sovereign Cloud went live in January, German-incorporated, physically separate. For some of our healthcare customers this is the only configuration their compliance team will sign.
  • On-prem Llama — for the workloads where the data genuinely cannot move. Expensive in engineering time, cheap per token, and the only honest answer in some regulated contexts.

The axes that actually decide

Benchmarks are vibes. The axes I actually score on, in roughly this order:

  1. Data residency and contractual posture. If the answer is wrong here, nothing else matters.
  2. Tool use reliability. Can the model call your functions correctly under load, with weird inputs, ten times in a row? Most of production AI is this.
  3. Latency and cost at your token mix. Per-token price is misleading. A model that uses fewer tokens to finish a task can be cheaper at $15/M than a competitor at $3/M.
  4. Context window — useful, not theoretical. 1M tokens is real. Quality at 800k tokens is what you should test.
  5. Eval stability across versions. Pinning a model version and having a regression suite matters more than chasing the leaderboard.

The model is a dependency. Treat it like one. Pin the version, write evals, monitor drift, and never let a vendor's release notes decide your production behaviour for you.

My current default stack

For what it's worth, here is roughly what I reach for:

  • Agentic coding and computer use — Claude Sonnet 4.6 on Bedrock.
  • Long-context retrieval and multimodal — Gemini 2.5 Pro on Vertex.
  • General product features, structured outputs, function-call-heavy backends — GPT-5.5 on Azure or Bedrock.
  • EU-residency-required workloads — Mistral, or Llama 4 on our own infrastructure.
  • Anything where data must not leave the building — Llama 4 self-hosted, no exceptions.

This is exactly why Jaydus.AI exists. We currently route across 14+ models because no single vendor wins on every axis, and locking a regulated product to one provider is a risk I am not willing to take on behalf of a customer. The platform abstracts the choice so the application doesn't care; the architect cares.

The field is consolidating, the answer still depends

By the end of 2026 I expect the gap between the top three closed models to be small enough that workload fit, residency, and cost will dominate raw capability for most production decisions. Open-weights will keep eating the regulated-data segment. The marketing will keep getting louder.

If someone tries to sell you a single-model strategy in 2026, ask them what they do when that model has an outage, when its pricing changes, when a customer demands EU residency, or when a new release regresses your eval suite. Hvis svaret deres er et skuldertrekk — walk away.

Håkon Berntsen

About the Author

Håkon Berntsen is a Systems Architect at MediVox AS with over 20 years of experience in IT development, systems architecture and artificial intelligence. He is also Chairman of Open Info and an expert in AI agents and autonomous systems.