Image source: Public Domain
Decagon, the leader in conversational AI agents for concierge customer experiences, announced Duet Autopilot, the first agent to deliver automatic and verifiable self-improvement for CX agents.
To measure Autopilot’s efficacy, Decagon also built DuetBench, the industry's first benchmark for evaluating agent self-improvement end-to-end. Against it, Duet Autopilot passed 93% of diagnostic tasks, exceeding the average human score.
"Autopilot is a shift from building agents by hand to managing agents that improve themselves," said Alan Yiu, VP of Product at Decagon. "Teams set the direction and review the work; Autopilot handles the diagnosing, testing, and editing that used to consume their week. Every fix compounds, which ultimately empowers businesses to provide their customers with a 24/7 AI concierge that gets measurably better with every interaction."
Closing the loop on agent improvement
Until now, improving an AI agent has been bottlenecked by manual work. As customer signals accumulate, teams must interpret feedback, decide on changes, test them, and ship improvements by hand. Too many cycles go into identifying and prioritizing high-impact updates, and even then, manual effort caps how much gets done. Duet Autopilot removes that constraint by acting on the full breadth of production signals.
Duet Autopilot delivers three core capabilities that work together as a continuous loop:
Because Autopilot is itself a Decagon agent, it is subject to its own improvement loop. Every reviewer correction and successful outcome feeds back into how it operates, so each cycle produces higher-quality updates than the last. This way, agent performance improves not at a fixed rate, but exponentially.
Proven in the field, formalized in the benchmark
Duet Autopilot is being validated with a cohort of enterprise customers and design partners across financial services, retail, and consumer technology, who are measuring its impact on resolution rates, escalation rates, and coverage.
“At our scale, manually reviewing conversations for errors isn't an option,” said Matt McCollum, senior manager of customer experience at Opendoor. “Decagon Autopilot frees our team to focus on decisions rather than digging through logs. It surfaces what changed, what was considered, and why. That transparency is what makes AI actually trustworthy in production.”
Furthermore, DuetBench fills a gap in how conversational AI agents are evaluated. Existing benchmarks measure whether an agent can resolve a fixed set of issues, but they don’t yet measure the improvement loop. By contrast, DuetBench measures whether Autopilot can make verifiable agent improvements, rather than producing plausible-looking changes.
By subscribing, you agree to receive email related to content and products. You unsubscribe at any time.
Copyright 2026, AI Reporter America All rights reserved.