Optyzone
About Optyzone

We don't sell agents. We ship systems that survive production.

Most agentic AI projects die between the demo and the second week of production traffic. We exist to ship the rest of the iceberg — the evals, the routing, the human-in-the-loop gates, the telemetry, the runbook, and the on-call. The part that turns "an interesting prompt" into a system your business runs on.

14 days
To first working demo on your stack
6–10 wks
To production-grade ship
≥95%
Eval pass rate before we ship
2-on-call
Engineers per system, 30 days post-launch
What we believe

Four positions we won't soften for a deal.

01

The model is a part. The system is the product.

Frontier models change every quarter. Your moat is the orchestration, the evals, the fallback paths, and the institutional knowledge baked into the tool layer — not the model number on the API call.

02

Eval-driven, or it isn't engineering.

Every agent we ship has a golden eval set the customer signs off on, regression runs on every change (model, prompt, tool, data), and a quality dashboard your team can read without us.

03

Human-in-the-loop is a design discipline, not a button.

We map every external action to an approval class — auto, soft-confirm, hard-approval, or manual-only — based on reversibility, blast radius, and regulatory posture. Override paths are first-class, not a fallback.

04

Observability before scale.

If you can't replay a 3-month-old trace in under 60 seconds, you can't operate the agent. We ship telemetry on day one, not after the first incident.

How we build

A methodology, not a vibe.

Six disciplines we've refined across every agent we've shipped. Each one survives a model upgrade, a vendor swap, and a year of production traffic.

EvalKit

Eval-driven development

We build a golden set of 200–500 representative inputs per agent — sourced from your real tickets/transcripts/docs, anonymized, and reviewed by your SMEs. Each is scored against a customer-signed rubric (correctness, faithfulness, refusal-when-needed, action-safety, tone). Every PR — prompt, tool, model, retrieval index — runs the full suite. CI blocks merge below 95% pass-rate on the rubric the customer owns.

Model Router

Right model per turn

A typical agent makes 6–20 model calls per task. We route each turn by reasoning depth, latency SLA, and per-call cost ceiling — Opus / Sonnet 4.x / GPT-5 / Haiku / Gemini Flash in the same workflow, with auto-failover. Router decisions are logged with rationale; you can A/B route policies the same way you A/B prompts.

HITL Gates

Approval classes, not approval buttons

Every external write action is classified at design time: auto (read-only / fully reversible), soft-confirm (one-tap, can be undone in 24h), hard-approve (human signature + audit trail), or manual-only (agent prepares, human executes). Mapping is reviewed with your compliance team before a single tool gets wired up.

Trace Layer

Every action, replayable

We instrument with OpenTelemetry-compatible tracing (Langfuse / Phoenix / Datadog LLM Observability, your call). Every trace captures the full plan, every tool call with inputs/outputs, the model variant, latency, token cost, and the human override if one happened. A 3-month-old incident replays in 60 seconds. Your team owns the dashboards on day one.

Failure-Mode Catalog

47 ways agents break — we test for all of them

We maintain an internal catalog of agent failure modes (prompt-injection vectors, tool-loop patterns, hallucinated citations, role-leak, refusal-bypass, retrieval-poisoning, cost-runaway, stale-context, etc.). Every system ships with regression tests for every applicable mode, and a quarterly review against new modes that emerge in the field.

Continuous Improvement

Production traces feed evals

Bad traces from production (low-confidence outputs, escalations, customer corrections) are auto-clustered and surfaced weekly. The most informative get added to the golden eval set — with your SME's sign-off — so the system measurably improves and regressions get caught the moment they appear.

Reference architecture

The stack we ship under every agent.

Eight layers. We pick the components per engagement (your cloud, your data residency, your model procurement) — but the contract between layers is always the same. That's how we hand off without leaving you stranded.

L7ChannelWeb / Voice (Twilio, LiveKit) / Email / Slack / Teams / SMS / In-product
L6OrchestratorPlan-act-reflect loop · session memory · tool-use scheduling · cost/latency budget enforcement
L5Model RouterAnthropic Opus/Sonnet/Haiku · OpenAI GPT-5/4o · Gemini · OSS via vLLM/TGI · failover + caching
L4Tool LayerTyped tool registry · scoped credentials · retry/idempotency · cost ledger · MCP / function-calling
L3KnowledgeHybrid retrieval (BM25 + dense + reranker) · per-doc citations · freshness SLAs · grounded-only mode
L2Safety + HITL GatesApproval-class router · PII guard · prompt-injection filters · refusal policies · audit log
L1ObservabilityOpenTelemetry traces · token/cost meter · eval dashboard · drift detection · replay UI
L0Eval + CIGolden set · rubric runner · regression on every PR · prod-trace harvester · drift alerts
The receipts

What "production-grade" means, in numbers.

47
Failure modes in our catalog
Each agent we ship has regression tests for the applicable subset.
200–500
Eval items per golden set
Sourced from your real traffic; signed off by your SMEs.
≥95%
Eval pass rate gate
CI blocks merge below threshold. No exceptions, no quiet downgrades.
60s
Trace replay latency
Any production decision, replayable end-to-end including tool I/O.
4
HITL approval classes
Auto · soft-confirm · hard-approve · manual-only. Mapped per action at design time.
30d
Post-launch on-call
Two engineers carrying pages for your system before handoff.
Engagement

Fixed-fee. Fixed-outcome. Ten weeks to prod.

We don't do open-ended T&M. Every engagement is a fixed-fee SOW against a fixed outcome — the KPI we commit to in week 0. You know what you're getting, when, and at what price.

Week 0Discovery

Pick the use case, define the contract

  • 1-day workshop: KPIs we'll commit to, success thresholds, hard constraints
  • Data + systems audit: read access we'll need, write actions to gate
  • Pick the pattern from the Optyzone catalog or co-design a new one
  • Sign a fixed-fee, fixed-outcome SOW — not a T&M open-ended retainer
Weeks 1–2Working Demo

Real agent on your data, in your sandbox

  • Working demo on real (anonymized if needed) data in your environment
  • First version of the golden eval set, with your SME sign-off on the rubric
  • L4 tool layer wired to your sandbox systems with scoped credentials
  • Demo + numbers reviewed with your stakeholders — go/no-go for pilot
Weeks 3–6Pilot

Production traffic, narrow surface

  • Limited rollout: % of traffic / single team / one geography
  • Full observability stack live; your team co-owns the dashboards
  • HITL gates running per the approval-class map
  • Daily standup with your team; weekly KPI review against committed thresholds
Weeks 6–10Hardening + GA

Ramp + runbook + on-call

  • Ramp to 100% under SLA; load test against the failure-mode catalog
  • Runbook for every alarmable condition, signed off by your on-call lead
  • Two of our engineers carry pages for 30 days alongside your team
  • Final handoff: code, evals, dashboards, runbook, decision log
Selected work

What it looks like in production.

Clients anonymized under NDA
Healthcare payer·Fortune-100 health insurer

Prior-authorization agent across 14 specialty lines

Claude Sonnet 4 + Haiku routerEpic / 3 payer portalsPhoenix tracingApproval-class: hard-approve on denial
The crux

The hard problem wasn't the form — it was reading specialist notes and matching them to per-payer policy. We built a citation-grounded retrieval layer that returned the exact policy clause, then the agent composed the request with quoted evidence. Reviewers could re-verify in 90 seconds.

Series-C fintech·Consumer neobank, 8M accounts

Real-time fraud investigation agent

GPT-5 Reasoning + Sonnet 4.6 fallback10+ internal systems via typed tool layerLangfuse + DataDogApproval-class: auto on hold, hard-approve on block
The crux

Analyst-grade investigations require evidence assembly across 10+ systems in under 2 seconds. We pre-fetched the high-signal context per transaction (device, velocity, ring-membership) and pushed it into the agent's plan so the LLM only reasoned over a curated brief — not the raw firehose.

Top-5 specialty retailer·$4B GMV omnichannel

Returns & exchange concierge

Sonnet 4.6 across chat + voice (LiveKit)OMS + WMS + loyalty + paymentsInline experimentationApproval-class: soft-confirm on refund > $200
The crux

Returns are a conversation, not a form. Our agent asks the right diagnostic question (fit, expectation, quality) and offers the right swap from live inventory — before defaulting to refund. Every conversation runs A/B against a hold-out where customers go straight to refund.

Founders

The two people accountable for every agent we ship.

Optyzone is founder-operated by design. Shashi takes your discovery call and stays your point of contact. Bhanu owns the architecture and reviews every PR. You always know who's accountable.

Bhanu Challa
Co-founder & CTO

Owns the Optyzone methodology, the reference architecture, and every agent that ships under our name. Two decades building data and AI systems for Fortune 500 healthcare, fintech, and retail teams — and shipping agent platforms before the category had a name. Reviews every PR; owns every architectural call.

Shashi Kumar
Co-founder & CEO

Runs every discovery call, scopes every engagement, and stays close to the customer through go-live. Background in enterprise sales and partnerships across regulated industries — the person who makes sure the agent we build is the agent your business actually needs. Your single point of contact from first call to handoff.

What we won't do

The deals we turn down — and why.

Ship a chatbot when your problem is a workflow agent.

We turn down deals where a deterministic workflow with light LLM augmentation would serve you better — and tell you so on the discovery call.

Run on yesterday's model because we're cheaper that way.

Our routing benchmarks the current frontier monthly. If a new model meaningfully improves your eval at a cheaper price, we ship the migration as part of the retainer.

Disappear after handoff.

Two engineers carry pages for 30 days post-launch. A standing quarterly review for the life of the system, included.

Sell you data labeling, training compute, or our own SaaS.

We have no proprietary product, no labeling vendor kickback, no compute markup. Pure build engagement — your IP, your stack, your operating cost.

You've seen the catalog, the methodology, and the numbers.

The next step is a 45-minute discovery call. We come prepared with a sketch of what we'd build for your team — and a candid view on whether you should even use us.