Rakuten Needed Seven Hours for 12.5 Million Lines. How Agent-Based Development Workflows Work.

Dr. Florian DrechslerMarch 4, 202610 min read
AI AgentsSoftware EngineeringArchitectureAI Development

In early 2026, Rakuten completed a complex vLLM implementation across a codebase of 12.5 million lines. Autonomously, in seven hours, with 99.9% numerical accuracy. No single developer wrote the code. An orchestrated team of AI agents took over the task, while humans specified the requirements and reviewed the results.

This is not a lab experiment. TELUS has also generated over 13,000 AI solutions and shipped engineering code 30% faster, saving over 500,000 hours in total. This is the direction professional software development is heading: away from writing individual lines of code, toward orchestrating specialized agents that implement features, write tests, and deliver pull requests.

From Conductor to Orchestrator

O'Reilly Radar describes the current shift as the next stage in the abstraction history of software development: Assembly, high-level languages, frameworks, AI code completion, and now agentic orchestration. Each stage moves the developer further from low-level details and closer to architectural intent.

In the previous "conductor" model, a developer guides a single AI assistant step by step through a task. In the "orchestrator" model, the developer describes a task, multiple specialized agents work asynchronously, and the result is a finished pull request for human review.

The consequence for day-to-day work: developer effort shifts. According to O'Reilly, it moves "forward into precise task specification and backward into code review." The core question changes from "How do I code this?" to "How do I break this task down so that agents can execute it autonomously?"

GitHub Copilot's Coding Agent demonstrates this pattern concretely: it "evaluates assigned tasks, makes the necessary changes, and opens pull requests." Developers continue working while the agent runs in the background, steering the outcome through PR reviews. Recommended use cases: routine bug fixes, background tasks, test generation, dependency updates, and prototyping.

Bitmovin documents the fully automated pipeline from Jira ticket to pull request, but warns clearly: "Multi-agent coding feels like magic at first, but without proper orchestration it quickly becomes a chaotic mess of broken builds and merge conflicts."

The DEV Community sums up the philosophical difference between tool categories concisely: Copilot and Cursor are assistive, they amplify what you're currently doing. Claude Code is agentic, you describe a goal and it executes a plan. It doesn't suggest, it acts: opens files, writes code, updates configurations, runs builds.

McKinsey underscores that this shift is more than a tool change: "Agentic AI initiatives that fundamentally rethink entire workflows deliver better outcomes than those that simply bolt AI onto existing processes."

When developers become orchestrators, an immediate practical question arises: what does the orchestra actually look like?

Multi-Agent Pipelines: Specialization Over Universal Agents

The answer lies in specialized multi-agent pipelines. The analogy is closer than expected: a modern assembly line doesn't work with a universal robot, but with specialized stations (welding, painting, quality control). Each station does one thing well. In the same way, production-grade agent systems break the development process into phases, each staffed with a focused agent that has clear responsibilities.

OpenObserve has the most detailed publicly documented pipeline: the "Council of Sub Agents." Six specialized AI agents work in a 6-phase pipeline:

PhaseAgentTask
1AnalystExtracts features and edge cases from requirements
2ArchitectCreates prioritized test plans (P0/P1/P2)
3EngineerGenerates Playwright test code with page object models
4SentinelQuality gate: blocks on critical issues
5HealerRuns tests, diagnoses failures, iterates up to 5x
6ScribeDocuments everything

The core architectural principle is context chaining: each agent receives the enriched output of its predecessors. The Engineer doesn't start with raw requirements but with the Architect's structured test plan. The Healer doesn't operate blindly but with the Analyst's edge case inventory.

The central insight across all sources: specialization beats generalization. Bounded agents with clear responsibilities outperform "super agents" that try to do everything at once.

The results speak for themselves. According to OpenObserve, test coverage grew from 380 to over 700 tests (up 84%), while flaky tests dropped by 85%, from over 30 down to 4 or 5. A production bug in the ServiceNow integration was caught before customers reported it. Feature analysis dropped from 45-60 minutes to 5-10 minutes, and human review time went from hours to minutes.

Academic research confirms the pattern. A literature survey on LLM-based multi-agent systems documents how development processes are organized into distinct phases, each managed by specialized agents with domain knowledge.

# Pseudocode: Multi-Agent Pipeline Configuration
pipeline:
  stages:
    - agent: analyst
      input: requirements.md
      output: feature-analysis.json
    - agent: architect
      input: feature-analysis.json
      output: test-plan.yaml
    - agent: engineer
      input: test-plan.yaml
      output: test-code/
    - agent: sentinel
      input: test-code/
      gate: block_on_critical
    - agent: healer
      input: test-code/ + sentinel-report.json
      max_iterations: 5
      output: verified-tests/

Anthropic's Agent Teams bring this pattern as a platform feature: one session acts as team leader, distributing tasks, while team members work in their own context windows. The recommendation: 3 to 5 team members for most workflows.

Workspace Isolation: Each Agent in Its Own Sandbox

When multiple agents work on the same code in parallel, each one needs an isolated environment. Tessl describes three escalating strategies: multiple IDE windows (minimal isolation), Git worktrees (separate file systems with shared Git history), and container isolation (each agent gets its own machine). In practice, Git worktrees have become the standard.

Agent Interviews documents the approach: "Each worktree is completely isolated, files created in one don't appear in any other." This makes it possible to have multiple agents working simultaneously on isolated copies of the codebase, each on its own branch. Background agents like Copilot CLI specifically use worktrees to avoid conflicts with active work.

The security aspect is central. Tessl names the fundamental problem: "Delegation to agents requires expanded access, but that's exactly what increases the potential for damage." Workspace isolation limits the blast radius: if an agent makes destructive changes, it only affects its isolated workspace, not the broader codebase.

Specialized agents produce code faster than any human team. But speed without quality control is a net negative.

Quality Gates: Three-Tier Safeguards

CodeScene puts it bluntly: "Agent speed amplifies both excellent and poor design decisions." Without automated controls at every pipeline step, every advantage from agents becomes a risk amplifier.

The solution is a three-tier safeguard architecture:

  1. Real-time review during code generation: immediate feedback before code reaches the staging area
  2. Pre-commit checks on staged files: style, security, and coverage checks before the commit
  3. Pre-flight branch analysis before the pull request: comprehensive cross-branch checks including cross-file impact analysis

OpenObserve's Sentinel agent is the most aggressive variant: a dedicated agent that "blocks on critical issues, no exceptions." When the Sentinel identifies critical violations (missing test coverage, security anti-patterns, architecture violations), the pipeline stops.

Augment Code documents the CI/CD integration: quality gates run as jobs, aborting immediately on critical issues, while everything else goes to human review. Required status checks prevent merges until the automated analysis passes.

An often overlooked point: quality gates are only as good as the underlying code quality. According to CodeScene, agents perform best in healthy code, with a recommended Code Health Score of 9.5+. This means refactoring must happen before agent deployment, not after. And even with automated gates, human judgment remains necessary: Amazon's evaluations framework shows that inter-agent communication, specialization quality, and logical consistency are dimensions that are difficult to quantify through automated metrics alone.

Whether these quality gates translate into perceived or actual productivity is a separate question entirely. Research shows a systematic gap between how productive developers feel and what the data says, which makes objective measurement at every gate even more critical.

Quality gates catch errors. But how do you prevent agents from working in the wrong direction in the first place?

Spec-Driven Development: Specifications Over Ad-Hoc Prompts

McKinsey/QuantumBlack describes the core problem precisely: "Different developers prompting the same model get different results. Quality depends on individual skill, not on systematic process." Worse still: "Decisions live in chat windows." When auditors or new team members ask about the reasoning behind architectural decisions, the logic is often lost.

Spec-driven development (SDD) replaces ad-hoc prompting with formal specifications. GitHub's Spec Kit defines four phases:

  1. Specify: describe the what and why, focused on user journeys
  2. Plan: provide tech stack and architectural constraints
  3. Tasks: the AI breaks specs into small, reviewable work units
  4. Implement: agents execute tasks sequentially or in parallel

The core idea: language models are strong at pattern recognition but need unambiguous instructions. Vague prompts force the AI to guess. Clear specifications separate the stable "what" from the flexible "how." McKinsey/QuantumBlack identifies the underlying architectural principle: "Successful agent-based implementations follow a specific pattern: deterministic orchestration for workflow control, combined with bounded agent execution and automated evaluation at every step." The orchestration layer is deterministic, even though agent execution within each task remains non-deterministic.

The Spec Kit works with GitHub Copilot, Claude Code, and Gemini CLI. SDD is tool-agnostic, not tied to any single platform.

Complementing this, AGENTS.md files have become established practice. GitHub's analysis of over 2,500 repositories shows that these files give agents project-specific context (build commands, test conventions, code style, boundaries). The recommendation: 150 lines maximum, example-driven rather than prose-heavy. The quality of these files correlates directly with the quality of the agent output.

Specifications and quality gates form the technical foundation. But the biggest risks of agent-based workflows are not purely technical.

Risks, Security, and the Role of Humans

Indirect prompt injection is considered the most critical vulnerability of agent-based systems. The attack vector extends to every document an agent reads and every tool it uses. The consequences are real: Replit's AI assistant deleted a production database in July 2025 despite explicit instructions that were supposed to prevent exactly that. OpenAI's Operator made an unauthorized purchase on Instacart. Natural language instructions alone are not a sufficient security control.

The answer is Human-in-the-Loop (HITL), the targeted integration of human oversight at critical decision points. Permit.io documents that well-designed HITL systems "escalate fewer than 10% of decisions to humans," while high-confidence paths continue autonomously. The goal is not to slow agents down, but to require human judgment only where it is truly needed: at PR merges, production deployments, security-critical changes, and architectural decisions.

Note: The EU AI Act's high-risk rules take effect on August 2, 2026. Article 14 legally mandates human oversight for high-risk AI systems. For organizations in regulated industries, HITL becomes not just a best practice but a compliance requirement.

The NIST AI Risk Management Framework has required the following since its 2025 update for agentic AI: mapping of all agent tool access permissions, circuit breakers for budget overruns or unauthorized API calls, and continuous monitoring for behavioral drift.

Anthropic positions the adoption of agent-based workflows as a strategic differentiator: "Teams that treat agentic coding as a strategic priority (rather than tactical tooling) will achieve disproportionate competitive advantages." Yet according to Anthropic's own data, developers can currently only fully delegate 0 to 20% of their tasks. The technology is early, but the direction is clear.

The action framework for 2026 can be distilled into four points:

  • Specifications before prompts: formal specs and AGENTS.md files deliver reproducible results
  • Quality gates at every pipeline stage: three-tiered, automated, with the Sentinel as a hard blocker
  • Workspace isolation as standard: Git worktrees or containers for every agent
  • Human checkpoints at irreversible decisions: PR merges, deployments, architectural decisions

The common thread: treat agent-based workflows as a systems engineering challenge, not a prompt engineering exercise. Invest in infrastructure, not in better phrasing. And heed McKinsey: the unit of change is the workflow, not the tool.

When building these pipelines, provider portability becomes a practical concern. Multi-model graphs and fallback chains require architectures that work across LLM providers, a challenge explored in depth in Provider-Agnostic Agents: Why Adapters Alone Aren't Enough.

Share this article

Related Articles