About Optyzone

We don't sell agents. We ship systems that survive production.

Most agentic AI projects die between the demo and the second week of production traffic. We exist to ship the rest of the iceberg — the evals, the routing, the human-in-the-loop gates, the telemetry, the runbook, and the on-call. The part that turns "an interesting prompt" into a system your business runs on.

14 days

To first working demo on your stack

6–10 wks

To production-grade ship

≥95%

Eval pass rate before we ship

2-on-call

Engineers per system, 30 days post-launch

What we believe

Four positions we won't soften for a deal.

The model is a part. The system is the product.

Frontier models change every quarter. Your moat is the orchestration, the evals, the fallback paths, and the institutional knowledge baked into the tool layer — not the model number on the API call.

Eval-driven, or it isn't engineering.

Every agent we ship has a golden eval set the customer signs off on, regression runs on every change (model, prompt, tool, data), and a quality dashboard your team can read without us.

Human-in-the-loop is a design discipline, not a button.

We map every external action to an approval class — auto, soft-confirm, hard-approval, or manual-only — based on reversibility, blast radius, and regulatory posture. Override paths are first-class, not a fallback.

Observability before scale.

If you can't replay a 3-month-old trace in under 60 seconds, you can't operate the agent. We ship telemetry on day one, not after the first incident.

How we build

A methodology, not a vibe.

Six disciplines we've refined across every agent we've shipped. Each one survives a model upgrade, a vendor swap, and a year of production traffic.

EvalKit

Eval-driven development

We build a golden set of 200–500 representative inputs per agent — sourced from your real tickets/transcripts/docs, anonymized, and reviewed by your SMEs. Each is scored against a customer-signed rubric (correctness, faithfulness, refusal-when-needed, action-safety, tone). Every PR — prompt, tool, model, retrieval index — runs the full suite. CI blocks merge below 95% pass-rate on the rubric the customer owns.

Model Router

Right model per turn

A typical agent makes 6–20 model calls per task. We route each turn by reasoning depth, latency SLA, and per-call cost ceiling — Opus / Sonnet 4.x / GPT-5 / Haiku / Gemini Flash in the same workflow, with auto-failover. Router decisions are logged with rationale; you can A/B route policies the same way you A/B prompts.

HITL Gates

Approval classes, not approval buttons

Every external write action is classified at design time: auto (read-only / fully reversible), soft-confirm (one-tap, can be undone in 24h), hard-approve (human signature + audit trail), or manual-only (agent prepares, human executes). Mapping is reviewed with your compliance team before a single tool gets wired up.

Trace Layer

Every action, replayable

We instrument with OpenTelemetry-compatible tracing (Langfuse / Phoenix / Datadog LLM Observability, your call). Every trace captures the full plan, every tool call with inputs/outputs, the model variant, latency, token cost, and the human override if one happened. A 3-month-old incident replays in 60 seconds. Your team owns the dashboards on day one.

Failure-Mode Catalog

47 ways agents break — we test for all of them

We maintain an internal catalog of agent failure modes (prompt-injection vectors, tool-loop patterns, hallucinated citations, role-leak, refusal-bypass, retrieval-poisoning, cost-runaway, stale-context, etc.). Every system ships with regression tests for every applicable mode, and a quarterly review against new modes that emerge in the field.

Continuous Improvement

Production traces feed evals

Bad traces from production (low-confidence outputs, escalations, customer corrections) are auto-clustered and surfaced weekly. The most informative get added to the golden eval set — with your SME's sign-off — so the system measurably improves and regressions get caught the moment they appear.

Reference architecture

The stack we ship under every agent.

Eight layers. We pick the components per engagement (your cloud, your data residency, your model procurement) — but the contract between layers is always the same. That's how we hand off without leaving you stranded.

L7ChannelWeb / Voice (Twilio, LiveKit) / Email / Slack / Teams / SMS / In-product

L6OrchestratorPlan-act-reflect loop · session memory · tool-use scheduling · cost/latency budget enforcement

L5Model RouterAnthropic Opus/Sonnet/Haiku · OpenAI GPT-5/4o · Gemini · OSS via vLLM/TGI · failover + caching

L4Tool LayerTyped tool registry · scoped credentials · retry/idempotency · cost ledger · MCP / function-calling

L3KnowledgeHybrid retrieval (BM25 + dense + reranker) · per-doc citations · freshness SLAs · grounded-only mode

L2Safety + HITL GatesApproval-class router · PII guard · prompt-injection filters · refusal policies · audit log

L1ObservabilityOpenTelemetry traces · token/cost meter · eval dashboard · drift detection · replay UI

L0Eval + CIGolden set · rubric runner · regression on every PR · prod-trace harvester · drift alerts

The receipts

What "production-grade" means, in numbers.

Failure modes in our catalog

Each agent we ship has regression tests for the applicable subset.

200–500

Eval items per golden set

Sourced from your real traffic; signed off by your SMEs.

≥95%

Eval pass rate gate

CI blocks merge below threshold. No exceptions, no quiet downgrades.

60s

Trace replay latency

Any production decision, replayable end-to-end including tool I/O.

HITL approval classes

Auto · soft-confirm · hard-approve · manual-only. Mapped per action at design time.

30d

Post-launch on-call

Two engineers carrying pages for your system before handoff.

Engagement

Fixed-fee. Fixed-outcome. Ten weeks to prod.

We don't do open-ended T&M. Every engagement is a fixed-fee SOW against a fixed outcome — the KPI we commit to in week 0. You know what you're getting, when, and at what price.

Week 0Discovery

Pick the use case, define the contract

1-day workshop: KPIs we'll commit to, success thresholds, hard constraints
Data + systems audit: read access we'll need, write actions to gate
Pick the pattern from the Optyzone catalog or co-design a new one
Sign a fixed-fee, fixed-outcome SOW — not a T&M open-ended retainer

Weeks 1–2Working Demo

Real agent on your data, in your sandbox

Working demo on real (anonymized if needed) data in your environment
First version of the golden eval set, with your SME sign-off on the rubric
L4 tool layer wired to your sandbox systems with scoped credentials
Demo + numbers reviewed with your stakeholders — go/no-go for pilot

Weeks 3–6Pilot

Production traffic, narrow surface

Limited rollout: % of traffic / single team / one geography
Full observability stack live; your team co-owns the dashboards
HITL gates running per the approval-class map
Daily standup with your team; weekly KPI review against committed thresholds

Weeks 6–10Hardening + GA

Ramp + runbook + on-call

Ramp to 100% under SLA; load test against the failure-mode catalog
Runbook for every alarmable condition, signed off by your on-call lead
Two of our engineers carry pages for 30 days alongside your team
Final handoff: code, evals, dashboards, runbook, decision log

Selected work

What it looks like in production.

Clients anonymized under NDA

Healthcare payer·Fortune-100 health insurer

Prior-authorization agent across 14 specialty lines

Claude Sonnet 4 + Haiku routerEpic / 3 payer portalsPhoenix tracingApproval-class: hard-approve on denial

The crux

The hard problem wasn't the form — it was reading specialist notes and matching them to per-payer policy. We built a citation-grounded retrieval layer that returned the exact policy clause, then the agent composed the request with quoted evidence. Reviewers could re-verify in 90 seconds.

Series-C fintech·Consumer neobank, 8M accounts

Real-time fraud investigation agent

GPT-5 Reasoning + Sonnet 4.6 fallback10+ internal systems via typed tool layerLangfuse + DataDogApproval-class: auto on hold, hard-approve on block

The crux

Analyst-grade investigations require evidence assembly across 10+ systems in under 2 seconds. We pre-fetched the high-signal context per transaction (device, velocity, ring-membership) and pushed it into the agent's plan so the LLM only reasoned over a curated brief — not the raw firehose.

Top-5 specialty retailer·$4B GMV omnichannel

Returns & exchange concierge

Sonnet 4.6 across chat + voice (LiveKit)OMS + WMS + loyalty + paymentsInline experimentationApproval-class: soft-confirm on refund > $200

The crux

Returns are a conversation, not a form. Our agent asks the right diagnostic question (fit, expectation, quality) and offers the right swap from live inventory — before defaulting to refund. Every conversation runs A/B against a hold-out where customers go straight to refund.

Founder

The person accountable for every agent we ship.

Optyzone is founder-operated by design. The same person who takes your discovery call owns the architecture and reviews every PR. You always know who's accountable.

Bhanu Challa

Founder & Principal Architect

Owns the Optyzone methodology, the reference architecture, and every agent that ships under our name. Two decades building data and AI systems for Fortune 500 healthcare, fintech, and retail teams — with hands-on experience designing and operating agentic systems in production. Your single point of contact from discovery through go-live, and every architectural call in between.

What we won't do

The deals we turn down — and why.

Ship a chatbot when your problem is a workflow agent.

We turn down deals where a deterministic workflow with light LLM augmentation would serve you better — and tell you so on the discovery call.

Run on yesterday's model because we're cheaper that way.

Our routing benchmarks the current frontier monthly. If a new model meaningfully improves your eval at a cheaper price, we ship the migration as part of the retainer.

Disappear after handoff.

Two engineers carry pages for 30 days post-launch. A standing quarterly review for the life of the system, included.

Sell you data labeling, training compute, or our own SaaS.

We have no proprietary product, no labeling vendor kickback, no compute markup. Pure build engagement — your IP, your stack, your operating cost.

You've seen the catalog, the methodology, and the numbers.

The next step is a 45-minute discovery call. We come prepared with a sketch of what we'd build for your team — and a candid view on whether you should even use us.

Book the discovery call Browse the catalog