Rakuten Needed Seven Hours for 12.5 Million Lines. How Agent-Based Development Workflows Work.

Dr. Florian DrechslerMarch 4, 202610 min read

AI AgentsSoftware EngineeringArchitectureAI Development

Rakuten hat Anfang 2026 eine komplexe vLLM-Implementierung quer durch eine Codebasis mit 12,5 Millionen Zeilen durchgefuehrt. Autonom, in sieben Stunden, mit 99,9% numerischer Genauigkeit. Kein einzelner Entwickler hat den Code geschrieben. Ein orchestriertes Team aus KI-Agenten hat die Aufgabe uebernommen, der Mensch hat spezifiziert und anschliessend das Ergebnis reviewed.

Das ist kein Laborexperiment. Auch TELUS hat ueber 13.000 KI-Loesungen generiert und Engineering-Code 30% schneller ausgeliefert, insgesamt ueber 500.000 eingesparte Stunden. Es ist die Richtung, in die sich professionelle Softwareentwicklung bewegt: weg vom Schreiben einzelner Codezeilen, hin zum Orchestrieren spezialisierter Agenten, die Features implementieren, Tests schreiben und Pull Requests liefern.

Vom Dirigenten zum Orchestrator

O'Reilly Radar beschreibt die aktuelle Verschiebung als naechste Stufe in der Abstraktionsgeschichte der Softwareentwicklung: Assembly, Hochsprachen, Frameworks, AI Code Completion, und jetzt Agentic Orchestration. Jede Stufe entfernt den Entwickler weiter von Low-Level-Details und naeher an architektonische Absicht.

Im bisherigen "Conductor"-Modell fuehrt ein Entwickler einen einzelnen KI-Assistenten Schritt fuer Schritt durch eine Aufgabe. Im "Orchestrator"-Modell beschreibt der Entwickler eine Aufgabe, mehrere spezialisierte Agenten arbeiten asynchron, und das Ergebnis ist ein fertiger Pull Request zur menschlichen Pruefung.

Die Konsequenz fuer den Arbeitsalltag: Der Entwickler-Aufwand verschiebt sich. Laut O'Reilly wandert er "nach vorne in praezise Aufgabenspezifikation und nach hinten in Code Review". Die Kernfrage aendert sich von "Wie programmiere ich das?" zu "Wie zerlege ich diese Aufgabe so, dass Agenten sie autonom umsetzen koennen?"

GitHub Copilots Coding Agent zeigt dieses Muster konkret: Er "evaluiert zugewiesene Aufgaben, nimmt die notwendigen Aenderungen vor und oeffnet Pull Requests". Entwickler arbeiten weiter, waehrend der Agent im Hintergrund laeuft, und steuern das Ergebnis ueber PR-Reviews. Die empfohlenen Einsatzgebiete: Routine-Bugfixes, Hintergrund-Tasks, Testgenerierung, Dependency-Updates und Prototyping.

Bitmovin dokumentiert die vollstaendig automatisierte Pipeline vom Jira-Ticket zum Pull Request, warnt aber deutlich: "Multi-Agent-Coding fuehlt sich anfangs wie Magie an, aber ohne richtige Orchestrierung wird es schnell ein chaotisches Durcheinander aus kaputten Builds und Merge-Konflikten."

Die DEV Community fasst den philosophischen Unterschied zwischen den Tool-Kategorien praegnant zusammen: Copilot und Cursor sind assistiv, sie verstaerken, was du gerade tust. Claude Code ist agentisch, du beschreibst ein Ziel und es fuehrt einen Plan aus. Es schlaegt nicht vor, es handelt: oeffnet Dateien, schreibt Code, aktualisiert Konfigurationen, fuehrt Builds aus.

McKinsey untermauert, dass dieser Wandel nicht nur ein Toolwechsel ist: "Agentic-AI-Initiativen, die gesamte Workflows grundlegend neu denken, liefern bessere Ergebnisse als solche, die KI einfach auf bestehende Prozesse draufschrauben."

Wenn Entwickler zu Orchestratoren werden, stellt sich sofort eine praktische Frage: Wie sieht das Orchester eigentlich aus?

Multi-Agent-Pipelines: Spezialisierung statt Universalagenten

Die Antwort liegt in spezialisierten Multi-Agent-Pipelines. Die Analogie liegt naeher als erwartet: Eine moderne Fertigungsstrasse funktioniert nicht mit einem Universalroboter, sondern mit spezialisierten Stationen (Schweissen, Lackieren, Qualitaetskontrolle). Jede Station macht eine Sache gut. Genauso zerlegen produktionsreife Agenten-Systeme den Entwicklungsprozess in Phasen, jede besetzt mit einem fokussierten Agenten mit klaren Verantwortlichkeiten.

OpenObserve hat die detaillierteste oeffentlich dokumentierte Pipeline: den "Council of Sub Agents". Sechs spezialisierte KI-Agenten arbeiten in einer 6-Phasen-Pipeline:

Phase	Agent	Aufgabe
1	Analyst	Extrahiert Features und Randfaelle aus Anforderungen
2	Architect	Erstellt priorisierte Testplaene (P0/P1/P2)
3	Engineer	Generiert Playwright-Testcode mit Page Object Models
4	Sentinel	Quality Gate: blockiert bei kritischen Problemen
5	Healer	Fuehrt Tests aus, diagnostiziert Fehler, iteriert bis zu 5x
6	Scribe	Dokumentiert alles

Das architektonische Kernprinzip ist Context Chaining: Jeder Agent erhaelt den angereicherten Output der Vorgaenger. Der Engineer startet nicht mit rohen Anforderungen, sondern mit dem strukturierten Testplan des Architects. Der Healer operiert nicht blind, sondern mit dem Edge-Case-Inventar des Analysts.

Die zentrale Erkenntnis quer durch alle Quellen: Spezialisierung schlaegt Generalisierung. Bounded Agents mit klaren Verantwortlichkeiten uebertreffen "Super-Agenten", die alles auf einmal versuchen.

Die Ergebnisse sprechen fuer sich. Laut OpenObserve wuchs die Testabdeckung von 380 auf ueber 700 Tests (plus 84%), waehrend flaky Tests um 85% zurueckgingen, von ueber 30 auf 4 bis 5. Ein Produktionsbug in der ServiceNow-Integration wurde gefangen, bevor Kunden ihn meldeten. Die Feature-Analyse sank von 45-60 Minuten auf 5-10 Minuten, menschliche Review-Zeit von Stunden auf Minuten.

Die akademische Forschung bestaetigt das Muster. Eine Literaturstudie zu LLM-basierten Multi-Agent-Systemen dokumentiert, wie Entwicklungsprozesse in distinkte Phasen organisiert werden, jede verwaltet von spezialisierten Agenten mit Domaenenwissen.

# Pseudocode: Multi-Agent Pipeline Konfiguration
pipeline:
  stages:
    - agent: analyst
      input: requirements.md
      output: feature-analysis.json
    - agent: architect
      input: feature-analysis.json
      output: test-plan.yaml
    - agent: engineer
      input: test-plan.yaml
      output: test-code/
    - agent: sentinel
      input: test-code/
      gate: block_on_critical
    - agent: healer
      input: test-code/ + sentinel-report.json
      max_iterations: 5
      output: verified-tests/

Anthropics Agent Teams bringen dieses Muster als Plattform-Feature: eine Session agiert als Teamleiter, verteilt Aufgaben, waehrend Teammitglieder in eigenen Kontextfenstern arbeiten. Die Empfehlung: 3 bis 5 Teammitglieder fuer die meisten Workflows.

Workspace Isolation: Jeder Agent in seiner eigenen Sandbox

Wenn mehrere Agenten parallel am selben Code arbeiten, braucht jeder eine isolierte Umgebung. Tessl beschreibt drei eskalierende Strategien: mehrere IDE-Fenster (minimale Isolation), Git Worktrees (getrennte Dateisysteme mit gemeinsamer Git-Historie), und Container-Isolation (jeder Agent bekommt seine eigene Maschine). In der Praxis haben sich Git Worktrees als Standard durchgesetzt.

Agent Interviews dokumentiert den Ansatz: "Jeder Worktree ist vollstaendig isoliert, Dateien, die in einem erstellt werden, erscheinen in keinem anderen." Das ermoeglicht es, mehrere Agenten gleichzeitig auf isolierten Kopien der Codebasis arbeiten zu lassen, jeder auf einem eigenen Branch. Background-Agenten wie Copilot CLI nutzen Worktrees gezielt, um Konflikte mit aktiver Arbeit zu vermeiden.

Der Sicherheitsaspekt ist zentral. Tessl benennt das Grundproblem: "Delegation an Agenten erfordert erweiterten Zugriff, aber genau das erhoeht den potenziellen Schaden." Workspace-Isolation begrenzt den Blast Radius: wenn ein Agent destruktive Aenderungen macht, betrifft das nur seinen isolierten Workspace, nicht die breitere Codebasis.

Spezialisierte Agenten produzieren Code schneller als jedes menschliche Team. Aber Geschwindigkeit ohne Qualitaetskontrolle ist ein Netto-Negativ.

Quality Gates: Dreistufige Absicherung

CodeScene bringt es auf den Punkt: "Agentengeschwindigkeit verstaerkt sowohl exzellente als auch schlechte Design-Entscheidungen." Ohne automatisierte Kontrollen an jedem Pipeline-Schritt wird jeder Vorteil durch Agenten zu einem Risikoverstaerker.

Die Loesung ist eine dreistufige Absicherungsarchitektur:

Real-Time Review waehrend der Code-Generierung: sofortiges Feedback, bevor Code in den Staging-Bereich gelangt
Pre-Commit Checks auf bereitgestellten Dateien: Pruefung von Style, Security und Coverage vor dem Commit
Pre-Flight Branch-Analyse vor dem Pull Request: umfassende branchuebergreifende Pruefung inklusive Cross-File-Impact-Analyse

OpenObserves Sentinel-Agent ist die aggressivste Variante: ein dedizierter Agent, der "bei kritischen Problemen blockiert, ohne Ausnahme". Wenn der Sentinel kritische Verstoesse identifiziert (fehlende Testabdeckung, Security-Anti-Patterns, Architekturverletzungen), stoppt die Pipeline.

Augment Code dokumentiert die CI/CD-Integration: Quality Gates werden als Jobs ausgefuehrt, bei kritischen Problemen wird sofort abgebrochen, alles andere geht an das menschliche Review. Required Status Checks verhindern Merges, bis die automatische Analyse bestanden ist.

Ein oft uebersehener Punkt: Quality Gates funktionieren nur so gut wie die zugrunde liegende Codequalitaet. Laut CodeScene performen Agenten am besten in gesundem Code, der empfohlene Code Health Score liegt bei 9,5+. Das bedeutet: Refactoring muss vor dem Agent-Einsatz passieren, nicht danach. Und selbst mit automatisierten Gates bleibt menschliches Urteil noetig: Amazons Evaluations-Framework zeigt, dass Inter-Agent-Kommunikation, Spezialisierungsqualitaet und logische Konsistenz Dimensionen sind, die sich schwer durch automatisierte Metriken allein quantifizieren lassen.

Ob sich diese Quality Gates in wahrgenommener oder tatsaechlicher Produktivitaet niederschlagen, ist eine ganz andere Frage. Forschung zeigt eine systematische Kluft zwischen dem Produktivitaetsgefuehl der Entwickler und dem, was die Daten sagen, was objektive Messung an jedem Gate umso wichtiger macht.

Quality Gates fangen Fehler ab. Aber wie verhindert man, dass Agenten ueberhaupt in die falsche Richtung arbeiten?

Spec-Driven Development: Spezifikationen statt Ad-hoc-Prompts

Das Grundproblem beschreibt McKinsey/QuantumBlack praezise: "Verschiedene Entwickler, die dasselbe Modell prompten, erhalten unterschiedliche Ergebnisse. Die Qualitaet haengt von individueller Faehigkeit ab, nicht von systematischem Prozess." Schlimmer noch: "Entscheidungen leben in Chat-Fenstern". Wenn Auditoren oder neue Teammitglieder nach der Begruendung architektonischer Entscheidungen fragen, ist die Logik oft verloren.

Spec-Driven Development (SDD) ersetzt Ad-hoc-Prompting durch formale Spezifikationen. GitHubs Spec Kit definiert vier Phasen:

Specify: beschreibe Was und Warum, fokussiert auf User Journeys
Plan: liefere Tech-Stack und architektonische Constraints
Tasks: die KI zerlegt Specs in kleine, reviewbare Arbeitseinheiten
Implement: Agenten fuehren Tasks sequenziell oder parallel aus

Der Kern: Sprachmodelle sind stark in Mustererkennung, brauchen aber eindeutige Anweisungen. Vage Prompts zwingen die KI zum Raten. Klare Spezifikationen trennen das stabile "Was" vom flexiblen "Wie". McKinsey/QuantumBlack identifiziert das zugrunde liegende Architekturprinzip: "Erfolgreiche agentenbasierte Implementierungen folgen einem spezifischen Muster: deterministische Orchestrierung fuer die Workflow-Steuerung, kombiniert mit begrenzter Agenten-Ausfuehrung und automatisierter Evaluation bei jedem Schritt." Die Orchestrierungsschicht ist deterministisch, auch wenn die Agent-Ausfuehrung innerhalb jeder Aufgabe non-deterministisch bleibt.

Das Spec Kit funktioniert mit GitHub Copilot, Claude Code und Gemini CLI. SDD ist toolagnostisch, nicht an eine Plattform gebunden.

Ergaenzend dazu haben sich AGENTS.md-Dateien etabliert. GitHubs Analyse von ueber 2.500 Repositories zeigt: Diese Dateien geben Agenten projektspezifischen Kontext (Build-Befehle, Testkonventionen, Code-Style, Grenzen). Die Empfehlung: maximal 150 Zeilen, beispielgetrieben statt prosaisch. Die Qualitaet dieser Dateien korreliert direkt mit der Qualitaet des Agent-Outputs.

Spezifikationen und Quality Gates bilden die technische Grundlage. Doch die groessten Risiken agentenbasierter Workflows sind nicht rein technischer Natur.

Risiken, Sicherheit und die Rolle des Menschen

Indirect Prompt Injection gilt als kritischste Schwachstelle agentenbasierter Systeme. Der Angriffsvektor erstreckt sich auf jedes Dokument, das ein Agent liest, und jedes Tool, das er nutzt. Die Konsequenzen sind real: Replits KI-Assistent loeschte im Juli 2025 eine Produktionsdatenbank trotz expliziter Anweisungen, die genau das verbieten sollten. OpenAIs Operator taetigte einen unauthorisierten Kauf bei Instacart. Natuerlichsprachliche Anweisungen allein sind keine ausreichende Sicherheitskontrolle.

Die Antwort ist Human-in-the-Loop (HITL), die gezielte Integration menschlicher Aufsicht an kritischen Entscheidungspunkten. Permit.io dokumentiert, dass gut designte HITL-Systeme "weniger als 10% der Entscheidungen an Menschen eskalieren", waehrend Pfade mit hoher Konfidenz autonom weiterlaufen. Das Ziel ist nicht, Agenten zu bremsen, sondern nur dort menschliches Urteil einzufordern, wo es tatsaechlich noetig ist: bei PR-Merges, Produktionsdeployments, sicherheitskritischen Aenderungen und Architekturentscheidungen.

Hinweis: Die Hochrisiko-Regeln des EU AI Act treten am 2. August 2026 in Kraft. Artikel 14 schreibt menschliche Aufsicht fuer Hochrisiko-KI-Systeme gesetzlich vor. Fuer Organisationen in regulierten Branchen wird HITL damit nicht nur Best Practice, sondern Compliance-Anforderung.

Das NIST AI Risk Management Framework fordert seit dem 2025er-Update fuer Agentic AI: Mapping aller Agent-Tool-Zugriffsrechte, Circuit Breaker bei Budgetueberschreitung oder unauthorisierten API-Aufrufen, und kontinuierliches Monitoring fuer Verhaltensdrift.

Anthropic positioniert die Adoption agentenbasierter Workflows als strategischen Differenzierungsfaktor: "Teams, die agentisches Coding als strategische Prioritaet behandeln (statt als taktisches Tooling), werden ueberproportionale Wettbewerbsvorteile erzielen." Doch laut Anthropics eigenen Daten koennen Entwickler aktuell nur 0 bis 20% ihrer Aufgaben vollstaendig delegieren. Die Technologie ist frueh, aber die Richtung ist klar.

Der Handlungsrahmen fuer 2026 laesst sich auf vier Punkte verdichten:

Spezifikationen vor Prompts: formale Specs und AGENTS.md-Dateien liefern reproduzierbare Ergebnisse
Quality Gates an jeder Pipeline-Stufe: dreistufig, automatisiert, mit dem Sentinel als harter Blocker
Workspace-Isolation als Standard: Git Worktrees oder Container fuer jeden Agenten
Menschliche Checkpoints an irreversiblen Entscheidungen: PR-Merges, Deployments, Architekturentscheidungen

Die gemeinsame Klammer: Behandle agentenbasierte Workflows als Systems-Engineering-Herausforderung, nicht als Prompt-Engineering-Uebung. Investiere in Infrastruktur, nicht in bessere Formulierungen. Und hoere auf McKinsey: Die Einheit der Veraenderung ist der Workflow, nicht das Tool.

Beim Aufbau dieser Pipelines wird Provider-Portabilitaet zu einer praktischen Frage. Multi-Model-Graphen und Fallback-Ketten erfordern Architekturen, die ueber LLM-Anbieter hinweg funktionieren, eine Herausforderung, die in Provider-agnostische Agenten: Warum Adapter allein nicht reichen vertieft wird.

In early 2026, Rakuten completed a complex vLLM implementation across a codebase of 12.5 million lines. Autonomously, in seven hours, with 99.9% numerical accuracy. No single developer wrote the code. An orchestrated team of AI agents took over the task, while humans specified the requirements and reviewed the results.

This is not a lab experiment. TELUS has also generated over 13,000 AI solutions and shipped engineering code 30% faster, saving over 500,000 hours in total. This is the direction professional software development is heading: away from writing individual lines of code, toward orchestrating specialized agents that implement features, write tests, and deliver pull requests.

From Conductor to Orchestrator

O'Reilly Radar describes the current shift as the next stage in the abstraction history of software development: Assembly, high-level languages, frameworks, AI code completion, and now agentic orchestration. Each stage moves the developer further from low-level details and closer to architectural intent.

In the previous "conductor" model, a developer guides a single AI assistant step by step through a task. In the "orchestrator" model, the developer describes a task, multiple specialized agents work asynchronously, and the result is a finished pull request for human review.

The consequence for day-to-day work: developer effort shifts. According to O'Reilly, it moves "forward into precise task specification and backward into code review." The core question changes from "How do I code this?" to "How do I break this task down so that agents can execute it autonomously?"

GitHub Copilot's Coding Agent demonstrates this pattern concretely: it "evaluates assigned tasks, makes the necessary changes, and opens pull requests." Developers continue working while the agent runs in the background, steering the outcome through PR reviews. Recommended use cases: routine bug fixes, background tasks, test generation, dependency updates, and prototyping.

Bitmovin documents the fully automated pipeline from Jira ticket to pull request, but warns clearly: "Multi-agent coding feels like magic at first, but without proper orchestration it quickly becomes a chaotic mess of broken builds and merge conflicts."

The DEV Community sums up the philosophical difference between tool categories concisely: Copilot and Cursor are assistive, they amplify what you're currently doing. Claude Code is agentic, you describe a goal and it executes a plan. It doesn't suggest, it acts: opens files, writes code, updates configurations, runs builds.

McKinsey underscores that this shift is more than a tool change: "Agentic AI initiatives that fundamentally rethink entire workflows deliver better outcomes than those that simply bolt AI onto existing processes."

When developers become orchestrators, an immediate practical question arises: what does the orchestra actually look like?

Multi-Agent Pipelines: Specialization Over Universal Agents

The answer lies in specialized multi-agent pipelines. The analogy is closer than expected: a modern assembly line doesn't work with a universal robot, but with specialized stations (welding, painting, quality control). Each station does one thing well. In the same way, production-grade agent systems break the development process into phases, each staffed with a focused agent that has clear responsibilities.

OpenObserve has the most detailed publicly documented pipeline: the "Council of Sub Agents." Six specialized AI agents work in a 6-phase pipeline:

Phase	Agent	Task
1	Analyst	Extracts features and edge cases from requirements
2	Architect	Creates prioritized test plans (P0/P1/P2)
3	Engineer	Generates Playwright test code with page object models
4	Sentinel	Quality gate: blocks on critical issues
5	Healer	Runs tests, diagnoses failures, iterates up to 5x
6	Scribe	Documents everything

The core architectural principle is context chaining: each agent receives the enriched output of its predecessors. The Engineer doesn't start with raw requirements but with the Architect's structured test plan. The Healer doesn't operate blindly but with the Analyst's edge case inventory.

The central insight across all sources: specialization beats generalization. Bounded agents with clear responsibilities outperform "super agents" that try to do everything at once.

The results speak for themselves. According to OpenObserve, test coverage grew from 380 to over 700 tests (up 84%), while flaky tests dropped by 85%, from over 30 down to 4 or 5. A production bug in the ServiceNow integration was caught before customers reported it. Feature analysis dropped from 45-60 minutes to 5-10 minutes, and human review time went from hours to minutes.

Academic research confirms the pattern. A literature survey on LLM-based multi-agent systems documents how development processes are organized into distinct phases, each managed by specialized agents with domain knowledge.

# Pseudocode: Multi-Agent Pipeline Configuration
pipeline:
  stages:
    - agent: analyst
      input: requirements.md
      output: feature-analysis.json
    - agent: architect
      input: feature-analysis.json
      output: test-plan.yaml
    - agent: engineer
      input: test-plan.yaml
      output: test-code/
    - agent: sentinel
      input: test-code/
      gate: block_on_critical
    - agent: healer
      input: test-code/ + sentinel-report.json
      max_iterations: 5
      output: verified-tests/

Anthropic's Agent Teams bring this pattern as a platform feature: one session acts as team leader, distributing tasks, while team members work in their own context windows. The recommendation: 3 to 5 team members for most workflows.

Workspace Isolation: Each Agent in Its Own Sandbox

When multiple agents work on the same code in parallel, each one needs an isolated environment. Tessl describes three escalating strategies: multiple IDE windows (minimal isolation), Git worktrees (separate file systems with shared Git history), and container isolation (each agent gets its own machine). In practice, Git worktrees have become the standard.

Agent Interviews documents the approach: "Each worktree is completely isolated, files created in one don't appear in any other." This makes it possible to have multiple agents working simultaneously on isolated copies of the codebase, each on its own branch. Background agents like Copilot CLI specifically use worktrees to avoid conflicts with active work.

The security aspect is central. Tessl names the fundamental problem: "Delegation to agents requires expanded access, but that's exactly what increases the potential for damage." Workspace isolation limits the blast radius: if an agent makes destructive changes, it only affects its isolated workspace, not the broader codebase.

Specialized agents produce code faster than any human team. But speed without quality control is a net negative.

Quality Gates: Three-Tier Safeguards

CodeScene puts it bluntly: "Agent speed amplifies both excellent and poor design decisions." Without automated controls at every pipeline step, every advantage from agents becomes a risk amplifier.

The solution is a three-tier safeguard architecture:

Real-time review during code generation: immediate feedback before code reaches the staging area
Pre-commit checks on staged files: style, security, and coverage checks before the commit
Pre-flight branch analysis before the pull request: comprehensive cross-branch checks including cross-file impact analysis

OpenObserve's Sentinel agent is the most aggressive variant: a dedicated agent that "blocks on critical issues, no exceptions." When the Sentinel identifies critical violations (missing test coverage, security anti-patterns, architecture violations), the pipeline stops.

Augment Code documents the CI/CD integration: quality gates run as jobs, aborting immediately on critical issues, while everything else goes to human review. Required status checks prevent merges until the automated analysis passes.

An often overlooked point: quality gates are only as good as the underlying code quality. According to CodeScene, agents perform best in healthy code, with a recommended Code Health Score of 9.5+. This means refactoring must happen before agent deployment, not after. And even with automated gates, human judgment remains necessary: Amazon's evaluations framework shows that inter-agent communication, specialization quality, and logical consistency are dimensions that are difficult to quantify through automated metrics alone.

Whether these quality gates translate into perceived or actual productivity is a separate question entirely. Research shows a systematic gap between how productive developers feel and what the data says, which makes objective measurement at every gate even more critical.

Quality gates catch errors. But how do you prevent agents from working in the wrong direction in the first place?

Spec-Driven Development: Specifications Over Ad-Hoc Prompts

McKinsey/QuantumBlack describes the core problem precisely: "Different developers prompting the same model get different results. Quality depends on individual skill, not on systematic process." Worse still: "Decisions live in chat windows." When auditors or new team members ask about the reasoning behind architectural decisions, the logic is often lost.

Spec-driven development (SDD) replaces ad-hoc prompting with formal specifications. GitHub's Spec Kit defines four phases:

Specify: describe the what and why, focused on user journeys
Plan: provide tech stack and architectural constraints
Tasks: the AI breaks specs into small, reviewable work units
Implement: agents execute tasks sequentially or in parallel

The core idea: language models are strong at pattern recognition but need unambiguous instructions. Vague prompts force the AI to guess. Clear specifications separate the stable "what" from the flexible "how." McKinsey/QuantumBlack identifies the underlying architectural principle: "Successful agent-based implementations follow a specific pattern: deterministic orchestration for workflow control, combined with bounded agent execution and automated evaluation at every step." The orchestration layer is deterministic, even though agent execution within each task remains non-deterministic.

The Spec Kit works with GitHub Copilot, Claude Code, and Gemini CLI. SDD is tool-agnostic, not tied to any single platform.

Complementing this, AGENTS.md files have become established practice. GitHub's analysis of over 2,500 repositories shows that these files give agents project-specific context (build commands, test conventions, code style, boundaries). The recommendation: 150 lines maximum, example-driven rather than prose-heavy. The quality of these files correlates directly with the quality of the agent output.

Specifications and quality gates form the technical foundation. But the biggest risks of agent-based workflows are not purely technical.

Risks, Security, and the Role of Humans

Indirect prompt injection is considered the most critical vulnerability of agent-based systems. The attack vector extends to every document an agent reads and every tool it uses. The consequences are real: Replit's AI assistant deleted a production database in July 2025 despite explicit instructions that were supposed to prevent exactly that. OpenAI's Operator made an unauthorized purchase on Instacart. Natural language instructions alone are not a sufficient security control.

The answer is Human-in-the-Loop (HITL), the targeted integration of human oversight at critical decision points. Permit.io documents that well-designed HITL systems "escalate fewer than 10% of decisions to humans," while high-confidence paths continue autonomously. The goal is not to slow agents down, but to require human judgment only where it is truly needed: at PR merges, production deployments, security-critical changes, and architectural decisions.

Note: The EU AI Act's high-risk rules take effect on August 2, 2026. Article 14 legally mandates human oversight for high-risk AI systems. For organizations in regulated industries, HITL becomes not just a best practice but a compliance requirement.

The NIST AI Risk Management Framework has required the following since its 2025 update for agentic AI: mapping of all agent tool access permissions, circuit breakers for budget overruns or unauthorized API calls, and continuous monitoring for behavioral drift.

Anthropic positions the adoption of agent-based workflows as a strategic differentiator: "Teams that treat agentic coding as a strategic priority (rather than tactical tooling) will achieve disproportionate competitive advantages." Yet according to Anthropic's own data, developers can currently only fully delegate 0 to 20% of their tasks. The technology is early, but the direction is clear.

The action framework for 2026 can be distilled into four points:

Specifications before prompts: formal specs and AGENTS.md files deliver reproducible results
Quality gates at every pipeline stage: three-tiered, automated, with the Sentinel as a hard blocker
Workspace isolation as standard: Git worktrees or containers for every agent
Human checkpoints at irreversible decisions: PR merges, deployments, architectural decisions

The common thread: treat agent-based workflows as a systems engineering challenge, not a prompt engineering exercise. Invest in infrastructure, not in better phrasing. And heed McKinsey: the unit of change is the workflow, not the tool.

When building these pipelines, provider portability becomes a practical concern. Multi-model graphs and fallback chains require architectures that work across LLM providers, a challenge explored in depth in Provider-Agnostic Agents: Why Adapters Alone Aren't Enough.

Rakuten llevó a cabo a principios de 2026 una compleja implementación de vLLM a través de una base de código de 12,5 millones de líneas. De forma autónoma, en siete horas, con 99,9% de precisión numérica. Ningún desarrollador individual escribió el código. Un equipo orquestado de agentes de IA asumió la tarea; el humano especificó y posteriormente revisó el resultado.

Esto no es un experimento de laboratorio. También TELUS generó más de 13.000 soluciones con IA y entregó código de ingeniería un 30% más rápido, con más de 500.000 horas ahorradas en total. Es la dirección en la que se mueve el desarrollo profesional de software: del escribir líneas de código individuales a la orquestación de agentes especializados que implementan funcionalidades, escriben tests y entregan Pull Requests.

De director a orquestador

O'Reilly Radar describe el cambio actual como la siguiente etapa en la historia de abstracción del desarrollo de software: Assembly, lenguajes de alto nivel, frameworks, AI Code Completion, y ahora Agentic Orchestration. Cada etapa aleja al desarrollador de los detalles de bajo nivel y lo acerca más a la intención arquitectónica.

En el modelo anterior de "Conductor", un desarrollador guía a un único asistente de IA paso a paso a través de una tarea. En el modelo de "Orchestrator", el desarrollador describe una tarea, múltiples agentes especializados trabajan de forma asíncrona, y el resultado es un Pull Request terminado para la revisión humana.

La consecuencia para el día a día laboral: el esfuerzo del desarrollador se desplaza. Según O'Reilly, migra "hacia adelante, a la especificación precisa de tareas, y hacia atrás, al Code Review". La pregunta central cambia de "¿Cómo programo esto?" a "¿Cómo descompongo esta tarea para que los agentes puedan implementarla de forma autónoma?"

El Coding Agent de GitHub Copilot muestra este patrón de forma concreta: "evalúa las tareas asignadas, realiza los cambios necesarios y abre Pull Requests". Los desarrolladores siguen trabajando mientras el agente se ejecuta en segundo plano y controlan el resultado a través de PR reviews. Los casos de uso recomendados: correcciones de bugs rutinarias, tareas en segundo plano, generación de tests, actualizaciones de dependencias y prototipado.

Bitmovin documenta la pipeline completamente automatizada desde el ticket de Jira hasta el Pull Request, pero advierte claramente: "La programación Multi-Agent se siente como magia al principio, pero sin una orquestación adecuada se convierte rápidamente en un caos de builds rotos y conflictos de merge."

La DEV Community resume la diferencia filosófica entre las categorías de herramientas de forma concisa: Copilot y Cursor son asistivos, amplifican lo que estás haciendo en ese momento. Claude Code es agéntico: describes un objetivo y ejecuta un plan. No sugiere, actúa: abre archivos, escribe código, actualiza configuraciones, ejecuta builds.

McKinsey respalda que este cambio no es solo un cambio de herramienta: "Las iniciativas de Agentic AI que rediseñan workflows completos desde cero ofrecen mejores resultados que aquellas que simplemente superponen IA sobre procesos existentes."

Cuando los desarrolladores se convierten en orquestadores, surge inmediatamente una pregunta práctica: ¿Cómo es realmente la orquesta?

Pipelines Multi-Agent: especialización en lugar de agentes universales

La respuesta está en pipelines Multi-Agent especializadas. La analogía es más cercana de lo esperado: una línea de producción moderna no funciona con un robot universal, sino con estaciones especializadas (soldadura, pintura, control de calidad). Cada estación hace una cosa bien. De la misma manera, los sistemas de agentes en producción descomponen el proceso de desarrollo en fases, cada una ocupada por un agente enfocado con responsabilidades claras.

OpenObserve tiene la pipeline documentada públicamente más detallada: el "Council of Sub Agents". Seis agentes de IA especializados trabajan en una pipeline de 6 fases:

Fase	Agente	Tarea
1	Analyst	Extrae funcionalidades y casos límite de los requisitos
2	Architect	Crea planes de test priorizados (P0/P1/P2)
3	Engineer	Genera código de test con Playwright usando Page Object Models
4	Sentinel	Quality Gate: bloquea ante problemas críticos
5	Healer	Ejecuta tests, diagnostica errores, itera hasta 5 veces
6	Scribe	Documenta todo

El principio arquitectónico central es el Context Chaining: cada agente recibe el output enriquecido de sus predecesores. El Engineer no parte de requisitos en bruto, sino del plan de test estructurado del Architect. El Healer no opera a ciegas, sino con el inventario de casos límite del Analyst.

La conclusión central a través de todas las fuentes: la especialización supera a la generalización. Los Bounded Agents con responsabilidades claras superan a los "super-agentes" que intentan hacer todo a la vez.

Los resultados hablan por sí mismos. Según OpenObserve, la cobertura de tests creció de 380 a más de 700 tests (un 84% más), mientras que los tests inestables se redujeron un 85%, de más de 30 a 4 o 5. Un bug en producción en la integración con ServiceNow fue detectado antes de que los clientes lo reportaran. El análisis de funcionalidades bajó de 45-60 minutos a 5-10 minutos, y el tiempo de revisión humana de horas a minutos.

La investigación académica confirma el patrón. Un estudio de literatura sobre sistemas Multi-Agent basados en LLM documenta cómo los procesos de desarrollo se organizan en fases distintas, cada una gestionada por agentes especializados con conocimiento de dominio.

# Pseudocódigo: Configuración de Pipeline Multi-Agent
pipeline:
  stages:
    - agent: analyst
      input: requirements.md
      output: feature-analysis.json
    - agent: architect
      input: feature-analysis.json
      output: test-plan.yaml
    - agent: engineer
      input: test-plan.yaml
      output: test-code/
    - agent: sentinel
      input: test-code/
      gate: block_on_critical
    - agent: healer
      input: test-code/ + sentinel-report.json
      max_iterations: 5
      output: verified-tests/

Agent Teams de Anthropic llevan este patrón como funcionalidad de plataforma: una sesión actúa como líder de equipo, distribuye tareas, mientras los miembros del equipo trabajan en sus propias ventanas de contexto. La recomendación: de 3 a 5 miembros de equipo para la mayoría de los workflows.

Workspace Isolation: cada agente en su propia sandbox

Cuando múltiples agentes trabajan en paralelo sobre el mismo código, cada uno necesita un entorno aislado. Tessl describe tres estrategias escalables: múltiples ventanas de IDE (aislamiento mínimo), Git Worktrees (sistemas de archivos separados con historial Git compartido) y aislamiento por contenedores (cada agente recibe su propia máquina). En la práctica, los Git Worktrees se han establecido como estándar.

Agent Interviews documenta el enfoque: "Cada Worktree está completamente aislado, los archivos creados en uno no aparecen en ningún otro." Esto permite que múltiples agentes trabajen simultáneamente en copias aisladas de la base de código, cada uno en su propia rama. Los agentes en segundo plano como Copilot CLI utilizan Worktrees específicamente para evitar conflictos con el trabajo activo.

El aspecto de seguridad es central. Tessl nombra el problema fundamental: "La delegación a agentes requiere acceso ampliado, pero precisamente eso aumenta el daño potencial." La Workspace Isolation limita el radio de impacto: si un agente realiza cambios destructivos, solo afecta a su workspace aislado, no a la base de código general.

Los agentes especializados producen código más rápido que cualquier equipo humano. Pero la velocidad sin control de calidad es un resultado neto negativo.

Quality Gates: protección en tres niveles

CodeScene lo expresa con claridad: "La velocidad de los agentes amplifica tanto las decisiones de diseño excelentes como las malas." Sin controles automatizados en cada paso de la pipeline, cada ventaja de los agentes se convierte en un amplificador de riesgo.

La solución es una arquitectura de protección en tres niveles:

Real-Time Review durante la generación de código: feedback inmediato antes de que el código llegue al área de staging
Pre-Commit Checks sobre los archivos preparados: verificación de estilo, seguridad y cobertura antes del commit
Pre-Flight Branch Analysis antes del Pull Request: verificación exhaustiva entre ramas, incluyendo análisis de impacto entre archivos

El agente Sentinel de OpenObserve es la variante más agresiva: un agente dedicado que "bloquea ante problemas críticos, sin excepción". Cuando el Sentinel identifica violaciones críticas (cobertura de tests insuficiente, anti-patrones de seguridad, violaciones de arquitectura), la pipeline se detiene.

Augment Code documenta la integración CI/CD: los Quality Gates se ejecutan como jobs; ante problemas críticos se aborta inmediatamente, todo lo demás pasa al review humano. Los Required Status Checks impiden merges hasta que el análisis automático haya sido aprobado.

Un punto que a menudo se pasa por alto: los Quality Gates solo funcionan tan bien como la calidad del código subyacente. Según CodeScene, los agentes rinden mejor en código saludable, con un Code Health Score recomendado de 9,5+. Esto significa que el refactoring debe ocurrir antes de emplear los agentes, no después. Y aun con gates automatizados, el juicio humano sigue siendo necesario: el framework de evaluación de Amazon muestra que la comunicación entre agentes, la calidad de la especialización y la consistencia lógica son dimensiones difíciles de cuantificar solo con métricas automatizadas.

Si estos Quality Gates se traducen en productividad percibida o real es una cuestión completamente diferente. La investigación muestra una brecha sistemática entre lo productivos que se sienten los desarrolladores y lo que dicen los datos, lo que hace que la medición objetiva en cada gate sea aún más crítica.

Los Quality Gates detectan errores. Pero, ¿cómo se evita que los agentes trabajen en la dirección equivocada desde el principio?

Spec-Driven Development: especificaciones en lugar de prompts ad-hoc

El problema fundamental lo describe McKinsey/QuantumBlack con precisión: "Distintos desarrolladores que hacen prompting al mismo modelo obtienen resultados diferentes. La calidad depende de la habilidad individual, no de un proceso sistemático." Peor aún: "Las decisiones viven en ventanas de chat". Cuando auditores o nuevos miembros del equipo preguntan por la justificación de decisiones arquitectónicas, la lógica a menudo se ha perdido.

El Spec-Driven Development (SDD) reemplaza el prompting ad-hoc por especificaciones formales. Spec Kit de GitHub define cuatro fases:

Specify: describe el Qué y el Por qué, enfocándose en User Journeys
Plan: proporciona el tech stack y las restricciones arquitectónicas
Tasks: la IA descompone las specs en unidades de trabajo pequeñas y revisables
Implement: los agentes ejecutan las tareas de forma secuencial o paralela

El punto clave: los modelos de lenguaje son fuertes en reconocimiento de patrones, pero necesitan instrucciones inequívocas. Los prompts vagos obligan a la IA a adivinar. Las especificaciones claras separan el "Qué" estable del "Cómo" flexible. McKinsey/QuantumBlack identifica el principio arquitectónico subyacente: "Las implementaciones exitosas basadas en agentes siguen un patrón específico: orquestación determinista para el control del workflow, combinada con ejecución de agentes acotada y evaluación automatizada en cada paso." La capa de orquestación es determinista, aunque la ejecución del agente dentro de cada tarea siga siendo no determinista.

El Spec Kit funciona con GitHub Copilot, Claude Code y Gemini CLI. El SDD es agnóstico respecto a herramientas, no está vinculado a una plataforma.

Complementariamente, los archivos AGENTS.md se han consolidado. El análisis de GitHub de más de 2.500 repositorios muestra que estos archivos proporcionan a los agentes contexto específico del proyecto (comandos de build, convenciones de test, estilo de código, límites). La recomendación: máximo 150 líneas, basado en ejemplos en lugar de prosa. La calidad de estos archivos correlaciona directamente con la calidad del output del agente.

Las especificaciones y los Quality Gates forman la base técnica. Sin embargo, los mayores riesgos de los workflows basados en agentes no son puramente técnicos.

Riesgos, seguridad y el rol del ser humano

La Indirect Prompt Injection se considera la vulnerabilidad más crítica de los sistemas basados en agentes. El vector de ataque se extiende a cada documento que un agente lee y a cada herramienta que utiliza. Las consecuencias son reales: el asistente de IA de Replit eliminó una base de datos de producción en julio de 2025 a pesar de instrucciones explícitas que debían impedirlo. El Operator de OpenAI realizó una compra no autorizada en Instacart. Las instrucciones en lenguaje natural por sí solas no son un control de seguridad suficiente.

La respuesta es Human-in-the-Loop (HITL), la integración deliberada de supervisión humana en puntos de decisión críticos. Permit.io documenta que los sistemas HITL bien diseñados "escalan menos del 10% de las decisiones a humanos", mientras que los caminos de alta confianza continúan de forma autónoma. El objetivo no es frenar a los agentes, sino exigir juicio humano solo donde realmente es necesario: en merges de PR, deployments a producción, cambios críticos de seguridad y decisiones de arquitectura.

Nota: Las reglas de alto riesgo del EU AI Act entran en vigor el 2 de agosto de 2026. El Artículo 14 establece legalmente la supervisión humana para sistemas de IA de alto riesgo. Para organizaciones en sectores regulados, HITL deja de ser solo una buena práctica y se convierte en un requisito de cumplimiento normativo.

El NIST AI Risk Management Framework exige desde su actualización de 2025 para Agentic AI: mapeo de todos los permisos de acceso a herramientas de los agentes, Circuit Breakers ante excesos de presupuesto o llamadas API no autorizadas, y monitoreo continuo de la deriva de comportamiento.

Anthropic posiciona la adopción de workflows basados en agentes como un factor de diferenciación estratégica: "Los equipos que tratan el coding agéntico como una prioridad estratégica (en lugar de como herramienta táctica) obtendrán ventajas competitivas desproporcionadas." Sin embargo, según los propios datos de Anthropic, los desarrolladores actualmente solo pueden delegar completamente entre el 0 y el 20% de sus tareas. La tecnología está en sus inicios, pero la dirección es clara.

El marco de acción para 2026 se puede condensar en cuatro puntos:

Especificaciones antes que prompts: specs formales y archivos AGENTS.md ofrecen resultados reproducibles
Quality Gates en cada etapa de la pipeline: tres niveles, automatizados, con el Sentinel como bloqueador estricto
Workspace Isolation como estándar: Git Worktrees o contenedores para cada agente
Checkpoints humanos en decisiones irreversibles: merges de PR, deployments, decisiones de arquitectura

El denominador común: trata los workflows basados en agentes como un desafío de ingeniería de sistemas, no como un ejercicio de Prompt Engineering. Invierte en infraestructura, no en mejores formulaciones. Y escucha a McKinsey: la unidad de cambio es el workflow, no la herramienta.

Al construir estas pipelines, la portabilidad del proveedor se convierte en una cuestión práctica. Los grafos multi-modelo y las cadenas de fallback requieren arquitecturas que funcionen entre proveedores de LLM, un desafío explorado en profundidad en Agentes agnósticos al proveedor: Por qué los adaptadores solos no son suficientes.

Share this article

The Structural Perception Gap

March 3, 202610 min read

A 19% slowdown that developers perceive as a 20% speedup. Why AI productivity research reveals a systematic gap between self-assessment and measurement, and what it means for teams and codebases.

AI DevelopmentProductivitySoftware EngineeringCode Quality

Provider-Agnostic Agents: Why Adapters Alone Aren't Enough

February 19, 20269 min read

Three LLM providers, three incompatible tool-calling schemas. How adapter patterns, MCP, and evaluation layers work together to achieve real agent portability.

AI AgentsLLMArchitectureMCP

Back to Blog