Skip to main content

Why I Built STUDIO: Four Generations of AI Code Supervision

Four AI orchestration systems taught me that 19 agents produce more drift than 3 supervised ones. STUDIO adds confidence scoring and preference learning.

7 min readBy Dakota Smith
Cover image for Why I Built STUDIO: Four Generations of AI Code Supervision

19 agents produced worse code than 3.

That counterintuitive result—discovered after building four generations of AI orchestration systems—drives the thesis behind STUDIO. The problem with AI coding tools isn't speed. It's drift. Ask Claude to build a feature—you get working code. Ask for another—more working code. After ten features, your codebase has three different patterns for the same problem, duplicate utilities scattered across files, and state management that contradicts itself. Each change looked correct in isolation. The aggregate is a mess.

I spent four iterations trying to fix this. Here's the path from reactive review to proactive supervision.

The Drift Problem

AI optimizes for task completion, not system coherence. It solves the immediate problem without considering how that solution fits existing patterns. Junior developers do this too—but juniors learn from code review. AI accepts the review, then makes the same mistake in a different file.

After 14 years in enterprise .NET and Sitecore—the same platform-specific complexity that drove the CMS analysis marketplace—I've seen what drift looks like at scale. Codebases that started clean but accumulated "quick fixes" until the original architecture disappeared. AI accelerates this. It writes code faster than you can review it, and each generation of code is internally consistent but externally disconnected from what came before.

The core insight: prevention beats detection. Catching drift after it happens is cleanup work. Stopping it before execution is architecture.

Generation 1: APL — Reactive Review

APL—the Autonomous Phased Looper—was my first attempt. Three phases: Plan, Execute, Review. The reviewer used Reflexion to self-critique and catch mistakes after execution. The learner agent persisted patterns to disk for future sessions.

APL worked. It built this blog with 47 stories across 6 epics, hitting Lighthouse 99. But the review phase was reactive—it caught problems after they existed in the codebase. The self-learning system helped patterns compound over time, and the ReAct execution loops caught individual task failures. Cross-task drift, though, slipped through. The reviewer could find inconsistent import paths. It couldn't prevent the architectural decisions that led to them.

APL proved that phased execution with self-learning produces working software. It didn't prove that working software stays coherent at scale.

Generation 2: ORC — 19 Agents, 19 Opinions

If three agents couldn't prevent drift, nineteen might.

ORC decomposed development into specialists: Architect, Security Engineer, Database Engineer, Frontend Specialist, QA, Technical Writer, and more. It added codebase analysis, Epic→Feature→Story planning, anti-slop detection, and pattern learning.

The system was powerful. One prompt could generate complete features with tests, docs, and security review. But 19 agents meant 19 opinions. The coordination overhead ate tokens—each specialist wanted to weigh in on every decision. Debugging required figuring out which agent introduced a problem. Specialists sometimes disagreed (the Architect wanted one pattern, the Security Engineer wanted another), and resolving conflicts took longer than building the feature manually.

More agents didn't mean better code. It meant more complexity and more coordination surface area. ORC taught me that capability without accountability produces chaos.

Generation 3: ALLOY — The Hybrid That Couldn't Decide

ALLOY tried to split the difference—three core agents that could summon specialists when needed. Hybrid approach. Less overhead than ORC, more capability than APL.

The problem moved: deciding when to summon specialists became its own source of drift. The system would skip consultation to move faster, then produce work that needed specialist review anyway. When it did consult, the specialist's recommendation sometimes conflicted with what the core agent had already started building.

ALLOY proved something important: the problem wasn't capability. It was accountability. No amount of agent architecture fixes drift if the system can bypass its own quality gates. The moment an agent can skip a check to "move faster," it will—and the drift compounds from there.

Generation 4: STUDIO — Supervision Over Scale

STUDIO returns to three agents: Planner, Builder, Content Writer. The difference is supervision.

Before any plan executes, STUDIO runs a mandatory questioning phase. It consults domain expert personas, challenges its own plan against five criteria (requirements coverage, edge cases, simplicity, integration fit, failure modes), and presents a confidence score. Low confidence triggers more questions. High confidence proceeds to execution.

╔══════════════════════════════════════════════════════════════╗
║  PLAN CONFIDENCE: 85%                                        ║
╠══════════════════════════════════════════════════════════════╣
║  Requirements:    [████████░░] 80%                           ║
║  Step Quality:    [██████████] 100%                          ║
║  Context:         [████████░░] 80%                           ║
║  Risk:            [████████░░] 80%                           ║
╚══════════════════════════════════════════════════════════════╝

The Builder executes exactly what the plan specifies. Each step has a validation command. If validation fails, it retries with hints from the failure output. If retries exhaust, it blocks and asks for help. Work doesn't silently fail—every step either passes validation or stops the pipeline.

When you correct something, STUDIO asks if it should remember the preference. Say yes, and it persists to a rules file that applies to future builds. Corrections compound instead of disappearing between sessions. This is the same persistence principle from APL's .apl/ directory, but applied to supervision rules rather than coding patterns.

The preference file grows over time:

{
  "preferences": [
    {
      "rule": "Use named exports, never default exports",
      "learned_from": "session_2026-01-28",
      "applied_count": 47
    },
    {
      "rule": "All API routes return typed response objects",
      "learned_from": "session_2026-01-30",
      "applied_count": 12
    }
  ]
}

After 10 corrections, STUDIO's plans already reflect your preferences. After 50, it rarely needs correction at all. The supervision tightens automatically.

The Pattern: What Four Generations Taught Me

Supervision beats autonomy. AI writes code fine. AI can't judge if that code fits your architecture without explicit checks. Validation at each step prevents drift better than review after the fact. APL detected problems reactively. STUDIO prevents them proactively.

Constraints beat capabilities. ORC could do more than STUDIO. STUDIO's constraints—mandatory questioning, quality gates, five challenges before execution—produce more consistent results. Limiting what AI can skip forces thoroughness.

Memory compounds value. Ephemeral sessions waste corrections. Persisting preferences means projects get smarter over time. This applies to both APL's pattern learning and STUDIO's preference rules.

Simplicity enables debugging. When ORC failed, finding which of 19 agents broke took longer than fixing the problem. Three agents means three places to look. My dev setup evolved toward this same principle—subtraction matters as much as addition.

The agent count doesn't correlate with output quality. Three supervised agents outperform 19 unsupervised ones. The supervision architecture—confidence scoring, mandatory challenges, persistent preferences—matters more than the number of specialists. This insight applies beyond coding tools to any multi-agent system architecture.

The Tradeoffs

STUDIO's supervision model has costs:

Supervision adds 30-60 seconds of latency before execution. The mandatory questioning phase, confidence scoring, and five-criteria challenge run before any code is written. For a 5-minute feature, that's 10-20% overhead. For quick fixes or single-line changes, the supervision pipeline costs more than the work itself.

Rapid prototyping doesn't fit the supervised model. When the goal is "try three approaches and see which feels right," mandatory validation gates slow exploration. Use vanilla Claude Code or APL for exploratory work. STUDIO is for execution against known requirements.

The confidence threshold (default 85%) needs per-team tuning. At 85%, STUDIO asks additional questions on roughly 40% of plans. Some teams find this too cautious; others want it higher. Tuning the threshold requires 5-10 sessions of observation to find the right balance for your codebase and risk tolerance.

The preference file grows and needs periodic pruning. After 50+ sessions, accumulated preferences can conflict—an early rule about naming conventions might contradict a later architectural decision. Quarterly review of the preferences file prevents stale rules from degrading output. The same maintenance principle applies to APL's pattern directory.

Key Takeaways

  • 19 agents produce more drift than 3 supervised ones. Coordination overhead and conflicting opinions outweigh the benefits of specialization.
  • Prevention beats detection. Catching drift reactively (APL) works. Preventing it proactively (STUDIO) works better.
  • Confidence scoring makes uncertainty visible. You see doubt before it becomes a problem in your codebase.
  • Persistent preferences create compounding quality. Each correction makes future builds more accurate. After 50 corrections, the system rarely needs intervention.
  • Constraints produce better output than capabilities. Mandatory quality gates force thoroughness. Optional checks get skipped.

Install STUDIO:

claude
/plugin marketplace add https://github.com/twofoldtech-dakota/studio.git
/plugin install studio@twofoldtech-dakota
/build "your goal here"

Watch the questioning phase. See the confidence score before execution begins. That visibility—knowing the system's uncertainty before it writes code—is the difference between supervised and autonomous development.

GitHub: twofoldtech-dakota/studio

Comments

Loading comments...