Engineering8 min read

Building observable agent loops that teams actually trust

Why the difference between a demo and production is telemetry, budgets, and human-readable traces—and how we wire them from day one.

Alex Rivera
Principal Engineer, Novelty Lab
observabilityagentsruntimetracing

Most agent failures in the wild are not model failures—they are control-flow failures. The system took an action nobody can explain, or it burned through token budgets before a human noticed. Production-grade agents need the same discipline as production-grade services: clear boundaries, measurable steps, and evidence when something goes wrong.

Treat each tool call as a span

We attach structured metadata to every tool invocation: intent, inputs (redacted where needed), latency, and outcome. That single habit turns “the bot did something weird” into a replayable timeline your security and support teams can audit.

  • Correlate user sessions with agent runs using a stable trace id.
  • Persist refusal reasons when policy blocks an action—do not swallow them.
  • Emit budget events when retries or fan-out approaches limits.

Human-in-the-loop is a feature, not a bug

Escalation paths should be first-class: which signals trigger review, what context the reviewer sees, and how overrides feed back into evaluation. If your dashboard cannot answer “who approved this and why,” you are not ready for regulated or high-stakes workflows.

Observability is how you earn the right to automate customer-facing decisions.

Novelty Lab runtime principles

Start with three golden paths—happy path, policy block, escalation—and prove you can drill from a customer ticket to the exact agent trace. Everything else is polish on top of that spine.