LLMs as translators, not oracles

The single design choice that separates reliable language-model systems from impressive demos — and how to evaluate it from outside.

Two jobs people give language models

There are fundamentally two roles an LLM can play in a serious system:

Oracle — the model is asked for an answer, a judgment, a decision. "Which technician should take this job?"
Translator — the model is asked to convert between representations. "Turn this sentence into this JSON schema."

Demos love oracles; oracles sound intelligent. Production systems should love translators, for one structural reason: a translation can be checked, a judgment can only be trusted.

Why translation is checkable

When the model's job is to produce a typed structure — a rule with scope, condition and requirement; a disruption report with vehicle, location, affected orders — three layers of verification become available that an oracle never offers:

Schema validation. The output parses against a strict schema or it doesn't. Failures are caught at the gate, repaired or rejected — never silently absorbed downstream.
Human confirmation. A dispatcher can read "this rule applies to Gold-tier customers, requires response ≤ 4h" and confirm it matches what they meant. Nobody can meaningfully confirm an oracle's hidden reasoning.
Benchmarking. Translation has a right answer, so you can build an evaluation suite: N inputs with known correct structures, scored mechanically — schema validity rate, field-level accuracy, repair-loop recovery. You can rerun it before every model swap. Try designing the equivalent for "makes good dispatch decisions."

The division of labor that follows

If the LLM only translates, something else must decide. In operational systems that's typically a symbolic component — a constraint solver checking plans against rules — which brings determinism and provable explanations to exactly the place where stakes are highest. The pattern (the neuro-symbolic gate) is covered in depth in our explainer on neuro-symbolic AI.

The slogan we use internally for Vera is blunt about it: the LLM never decides. It is not a limitation of the current model generation that we expect to relax later. It is what makes the rest of the system's guarantees possible at all.

How to evaluate a vendor's claim

Useful questions to ask anyone selling an "AI agent" for operations:

At the moment an action is committed, what component makes the accept/reject call — a model or a deterministic checker?
If a model's output is wrong, where is it caught, and what does the failure look like to a user?
Can they show you their extraction evaluation numbers — and what happens to those numbers when the underlying model is swapped?
When the system says no, can it name the exact rule and facts responsible — and is that explanation derived from the computation or generated alongside it?

Vendors building translator-architectures answer these in one breath. Oracle-architectures change the subject to model quality — which is precisely the thing nobody can guarantee.