Back to blog

From demos to production: why agentic AI hits a wall in regulated environments

8 min read

The agent demos that get reposted are almost always the same shape. Someone asks a model to book a flight, file a Jira ticket, or refactor a repo, the screen-recording speeds up the boring parts, and the audience claps. I clap too. Then I close the tab and go back to the kind of agent work that pays the bills, which mostly involves convincing a clinical system that the agent did not in fact hallucinate a referral letter at 02:14 on a Tuesday.

This is the gap nobody on the demo timeline wants to talk about: the distance between an agent that can do the thing once and an agent that can do the thing ten thousand times in a regulated environment without anyone losing their license. It is not a small gap. In healthcare and financial services it is the entire job.

Where the stack actually is in 2026

Let us be honest about the inputs, because the marketing layer is louder than usual right now.

  • MCP has stopped being Anthropic's protocol and become industry plumbing. After the donation to the Linux Foundation late last year and co-sponsorship from OpenAI, Google, and Microsoft, MCP is now the default way to expose tools and context to an agent. Adoption numbers vary depending on who is selling them, but enterprise teams running at least one MCP-backed agent in production is no longer rare. The 2026 roadmap is squarely about the unsexy bits: auth, gateways, audit, governance.
  • Agent SDKs have consolidated. LangGraph still has the most production mileage and the best story for durable execution and checkpointing. The OpenAI Agents SDK has matured into something resembling the harness behind Codex, with sandboxing and tracing baked in. Anthropic's Claude Agent SDK (the rebrand of the old Claude Code SDK) is the cleanest fit if you are already on Claude and want hooks, subagents, and MCP without writing your own scaffolding. CrewAI and AutoGen are still around and still fine for what they are.
  • Browser-use and computer-use work, with caveats. The best open-source browser agents post WebVoyager numbers in the high eighties. That is impressive and also nowhere near production reliability for unattended multi-step work. The honest framing is: great for supervised workflows, dangerous for autonomous ones, and the gap closes slowly.

So the tooling is real. That is not the same as saying it is ready for the place I work.

Why agents fail harder than chatbots

A chatbot that hallucinates writes a wrong sentence. An agent that hallucinates calls a tool. The blast radius is different by orders of magnitude, and the failure modes are too:

  • Non-determinism at every step. Same prompt, same tools, different trajectory. Fine in a demo, painful in a regulated audit.
  • Side effects. Tool calls write to systems. A retry is not free. An idempotency story is not optional.
  • Partial failures. Step three of seven succeeds, step four times out, step five runs anyway because the planner did not notice. Welcome to distributed systems with a stochastic scheduler.
  • Hallucinated tool calls. The model invents an argument, a parameter, sometimes a whole tool. Tight schemas help. They do not eliminate the class.

None of this is fatal. All of it is real, and most teams ship before they have answers for any of it.

What actually works

The patterns I keep coming back to, across Eir Tec's clinical agents, MediVox transcription pipelines, and the more experimental work at Skjld Labs, are boring on purpose:

Tight tool schemas, allow-listed actions

Every tool gets a strict schema. Every action that touches a record-of-truth system goes through an allow-list. If the agent wants to do something not on the list, it asks. This sounds restrictive because it is. It is also the thing that lets the rest of the system breathe.

Deterministic outer loops

The agent is the inner loop. The outer loop is a plain scheduler with retries, timeouts, and explicit state. Most of what people call multi-agent is just an outer loop with better marketing. Treat the LLM as the smart core inside a dumb shell, not the other way around.

Observability before cleverness

Log every tool call, every model call, every prompt, every output. Replay-ability is not a nice-to-have, it is the audit trail. In a clinical context, an agent that adjusts a referral letter has to leave the same evidence a clinician would have left. If you cannot reconstruct the exact decision path months later, the agent does not ship. Roughly nine in ten teams now run some form of agent observability; that is the bar.

Evaluators in the loop, humans at the checkpoints

Cheap evaluator models score every trajectory in production, not just in CI. High-stakes branches route to a human checkpoint, every time, no exceptions. Output-only evals are misleading; trajectory evals catch failures that final-answer scoring misses by a wide margin.

A sketch of what an honest agent system looks like

Strip the diagrams down and you get something like this:

  • An outer scheduler in plain code, deterministic, durable, with checkpoints.
  • An inner agent with a small toolbox and a tight system prompt.
  • An explicit tool boundary, ideally over MCP, with allow-lists and per-tool auth.
  • An observability layer that captures full trajectories, prompts, and outputs.
  • An eval pipeline running offline on golden sets and online on a sample of live traffic.
  • Human-in-the-loop checkpoints for anything that touches a patient, a payment, or a regulator.

That is the shape. None of it is novel. All of it is what separates an agent that gets a press release from an agent that gets renewed for another year.

The compliance angle is not optional

In a clinical workflow, every action needs an audit trail with the same standard a human clinician would meet. That is not a tooling problem, it is a design constraint. If the agent edits a referral, there must be a reconstructable record of the inputs, the reasoning, the model version, the tool calls, and the human who signed off. Anything less and you do not have an agent, you have an unlicensed clinician with a great vocabulary.

Agents without evals are just hopes. Agents without audit trails are just liabilities.

The closing bit

Agentic AI is real. The frameworks are real. MCP as a substrate is real. Browser-use and computer-use are real, with the caveat that real and reliable are different words.

What is also real is that most teams are shipping agents before they have answers for non-determinism, side effects, partial failures, and audit. In a consumer product that is a learning experience. In healthcare or financial services it is a headline.

Build the boring layer first. The agent is the easy part.

Håkon Berntsen

About the Author

Håkon Berntsen is a Systems Architect at MediVox AS with over 20 years of experience in IT development, systems architecture and artificial intelligence. He is also Chairman of Open Info and an expert in AI agents and autonomous systems.