Live clinical transcription: what we learned shipping ASR + LLM into Norwegian hospitals

Two years ago, building live clinical transcription for Norwegian hospitals was a research project. Today it's a product, deployed, billed for, and used in actual consultations. Between MediVox here in Norway and Eir Tec in the UK, we've shipped enough of these pipelines to have opinions. Most of them are about the parts nobody puts in the demo video.

This is the unglamorous version.

Why clinical transcription is finally tractable

The honest answer: we got lucky on three fronts at once.

Whisper-class models opened the door. By the time we started, OpenAI's Whisper, distil-whisper, NVIDIA's Parakeet family, and the closed-source contenders (Speechmatics, AssemblyAI) had all converged on something usable. Streaming variants stopped being toys. For Norwegian specifically, NB-Whisper from the National Library's AI Lab gave us a real starting point — fine-tuned on Bokmål and Nynorsk, with reported improvements over base Whisper Large-v3 on Norwegian benchmarks. Without that head start we'd have spent a year just collecting audio.

GPU economics tipped over. Running large ASR plus an LLM summariser per consultation used to be a budget conversation. With newer inference stacks, batching, and quantised variants of distil-whisper, the per-minute cost dropped low enough that hospitals stopped flinching at the quote.

On-prem became a real option. Some clinics will simply not let audio leave their walls. Being able to run the stack on a single GPU box on-site, or in a Norsk helsenett-connected tenancy, decides whether you get the contract.

Norwegian is the hard problem

English clinical ASR is, frankly, solved enough. Norwegian is a different sport.

Low-resource by global standards. Less labelled medical audio, fewer pre-trained checkpoints, smaller eval sets.
Dialects. A GP in Bergen, a consultant in Tromsø, and a junior doctor from Stavanger do not sound like the same language to a model trained on Oslo Bokmål.
Bokmål and Nynorsk both matter. Refusing one is not commercially viable.
Code-switching English drug names. Norwegian clinicians dictate in Norwegian and then say amoxicillin or metoprolol in something between English and Norwegianised pronunciation. Models trained on either alone get this wrong.
Domain vocabulary. ICD-10 codes spoken aloud, abbreviations, dosage units, anatomical Latin. None of this is in the FLEURS dataset.

We invested in custom vocabulary biasing, a medical lexicon hot-word layer, and a per-clinic adaptation pass. None of that is glamorous. All of it moves the needle more than swapping the base model.

The architectural shape that worked

After a few rewrites we settled on a pipeline that looks roughly like this:

Streaming ASR layer producing partial and final hypotheses, with a medical hot-word bias.
Diarization, so we can distinguish clinician from patient. We do this online with a lightweight model and re-segment offline at session end.
An LLM step that turns the cleaned transcript into structured outputs: SOAP notes, referral letters, discharge summaries, patient-friendly recaps. Each output type has its own prompt, its own eval set, its own owner.
A clinician edit pass in the UI. The clinician is always the final author. Nothing is auto-filed.
An audit trail of every generation, every edit, every model version, retained per Norwegian healthcare retention rules.

The key design choice: the LLM never sees raw audio, only transcripts. The ASR layer never sees the structured journal entry. Boundaries make incidents debuggable.

Concrete failure modes

Things that bit us, in roughly the order they hurt:

Hallucinated medication doses. The LLM, helpful to a fault, would fill in plausible-but-wrong dosages when the clinician trailed off. Fix: explicit refusal patterns in the prompt, plus a post-generation validator that flags any drug name without an explicit dose in the transcript.
Drift on long sessions. Twenty-minute consultations produced summaries weighted toward the last five minutes. Fix: chunked summarisation with a running structured state, not one giant prompt.
Mis-handled patient identifiers. Fødselsnummer spoken aloud, partially captured, embedded into a summary. Fix: a redaction pass before the LLM ever sees the transcript, with structured fields populated from the EHR rather than the audio.
Latency spikes mid-consultation. A cold GPU, a noisy neighbour, a model server restart. Clinicians notice within two seconds. Fix: warm pools, hard timeouts, and a graceful local fallback that keeps capturing audio even when the cloud leg blips.

Privacy decisions that are actually sales blockers

These are not nice-to-haves. Without them, procurement closes the door:

Single-tenant deployment per clinic, on Azure EU regions or on-prem in a Norsk helsenett context.
Anonymised retention by default. Raw audio deleted on a short clock unless the clinic explicitly opts in.
No training on clinical data without per-clinic, per-purpose consent. Ever.
A signed DPA that covers special-category health data under GDPR Article 9, not just generic personal data.

Evals are the unsexy hero

Every model upgrade — ASR or LLM — goes through a clinician-rated eval set before it gets promoted. Real consultations, anonymised, scored on factuality, completeness, and a "would you sign this?" question.

The day we skipped that, on a "small" prompt change, we shipped a regression that softened recommended follow-ups. A clinician caught it the same afternoon. We were lucky. Now there is no path to production that bypasses the eval.

The model is not the product. The eval set is the product. The model is replaceable.

The outcome that actually matters

We used to talk about WER lifts. Nobody buying this software cares about WER. What they care about is whether their consultants are still doing journal entries at 21:00 on a Tuesday.

The metric that closes deals is the one we hear in every renewal conversation: I get my evenings back. That is the number. Everything in the stack — the ASR choice, the diarization, the LLM prompts, the eval gates, the Azure region — exists to defend that number.

Closing

ASR plus LLM in clinical workflows is not the hard problem anymore. The hard problem is operationalising it without breaking trust: handling Norwegian properly, refusing to hallucinate doses, keeping data inside the right border, and gating every change behind a clinician-rated eval.

Get those right and the technology disappears into the background, which is exactly where clinicians want it.

Live clinical transcription: what we learned shipping ASR + LLM into Norwegian hospitals

Why clinical transcription is finally tractable

Norwegian is the hard problem

The architectural shape that worked

Concrete failure modes

Privacy decisions that are actually sales blockers

Evals are the unsexy hero

The outcome that actually matters

Closing

Explainable AI isn't optional anymore — a practical architecture for high-stakes decisions

From demos to production: why agentic AI hits a wall in regulated environments

Why clinical transcription is finally tractable

Norwegian is the hard problem

The architectural shape that worked

Concrete failure modes

Privacy decisions that are actually sales blockers

Evals are the unsexy hero

The outcome that actually matters

Closing

More Posts

Explainable AI isn't optional anymore — a practical architecture for high-stakes decisions

From demos to production: why agentic AI hits a wall in regulated environments