I want to share something I've been working on for the past several months. It's not perfect, I'm still iterating on it daily, but it's changed how I approach software development fundamentally. And I think the lessons might be useful to others exploring this space.
This is my attempt at building a multi-agent orchestration system for Claude Code. Think of it as trying to create a virtual engineering team where different AI agents handle different responsibilities, architects who design, implementers who code, reviewers who validate. It sounds ambitious because it is. I've made plenty of mistakes along the way, and I'm certain there are better approaches I haven't discovered yet.
But first, let me tell you why I even started down this path.
The Problem I Was Trying to Solve
I work on a large enterprise platform, a distributed system with about 12 microservices, two frontend applications, smart contracts, and a Kubernetes infrastructure spanning multiple environments. It's the kind of codebase where a single feature might touch the database schema, three backend services, blockchain logic, and both frontends.
When I first started using Claude Code, I was genuinely excited. For small tasks, it was incredible. Fix this function. Write this component. Debug this error. Brilliant.
But when I tried to use it for larger work, building entire modules, coordinating changes across services, things started breaking down:
The context problem: About halfway through a complex task, I'd notice the AI forgetting decisions we'd made earlier in the conversation. It would suggest approaches we'd already ruled out, or introduce inconsistencies with code it had written an hour ago.
The specialization problem: I'd ask for a database schema design and get something that worked but didn't follow our established patterns. Not because Claude couldn't, but because one prompt can't contain everything about how we do things.
The quality problem: Code would get written, and then I'd spend time finding bugs that a fresh set of eyes would have caught immediately. The same "mind" that wrote the code was reviewing it.
I don't think any of these are failures of the AI itself. They're failures of how I was using it. A single human engineer working alone faces similar problems at scale, that's why we have teams.
So I thought: what if I tried to create a team?
My First Attempts (And Why They Failed)
I want to be honest about this: my first several attempts didn't work well.
Attempt 1: Just spawn more agents
My naive first approach was to just spawn sub-agents for different parts of a task. "You handle the backend, you handle the frontend." The problem? They had no coordination. Agent A would make assumptions that Agent B contradicted. There was no shared understanding of what we were building.
Attempt 2: Detailed upfront specifications
Next, I tried writing extremely detailed specifications before spawning any agents. Every file to create, every function signature, every edge case. This worked better, but it took so long to write the specs that I might as well have just written the code myself. And the specs would inevitably miss something, causing cascading problems.
Attempt 3: Sequential pipeline
Then I tried a strict sequence: design agent → implementation agent → review agent → fix agent. Linear, predictable. The problem was that when the review agent found issues, the fix agent had no context about why the original decisions were made. It would "fix" things by undoing intentional choices.
Each failure taught me something. The key insight that eventually emerged: agents need hierarchy, not just sequence. They need to know who to ask when they're uncertain. They need someone coordinating their work who can hold the bigger picture.
The Architecture I Eventually Landed On
After months of iteration, here's what I'm currently using. I want to emphasize "currently", I change things every few weeks as I learn more.
How This Maps to a Real Project
Here's how these agents relate to a typical enterprise architecture:
The Breakthrough: The Prompt-Writer Agent
If there's one thing I've learned that I wish I'd understood earlier, it's this: the quality of the prompt determines everything.
For months, I was spawning agents with prompts like:
"Fix the login bug in the auth service"And wondering why the results were inconsistent. The agent would spend half its context just figuring out what files to look at, what our conventions were, what had been tried before.
Then I tried something different. I created a lightweight agent whose only job is to write prompts for other agents.
How It Works
Instead of:
Task(prompt="Fix the login bug in the auth service", agent="debugger")I now do:
Step 1: Task(
prompt="Generate a debugging prompt for the login issue in the auth service.
Include relevant file paths, our port conventions, recent
changes, and verification steps.",
agent="prompt-writer"
)
Step 2: Task(
prompt=[the detailed prompt from step 1],
agent="debugger"
)What the Prompt-Writer Produces
The prompt-writer reads the relevant files, understands our conventions, and produces something like:
## Debugging Task: AUTH-001
### Context
Users report 401 errors on login despite valid credentials.
This started after commit abc123 (JWT refactor, 2026-01-28).
### Your Task
Diagnose why JwtAuthGuard is rejecting valid tokens.
### Files to Examine
| File | Why | Key Lines |
|------|-----|-----------|
| src/guards/jwt-auth.guard.ts | Token validation | 45-78 |
| src/services/token.service.ts | Token generation | 112-145 |
### Project Context
- Auth service runs on port 3041
- JWT secret is in K8s secret, not hardcoded
- We use Clean Architecture: guard is presentation layer
### What Success Looks Like
- Root cause identified with file:line reference
- Explanation of why this causes the symptom
- Specific fix recommendation
### How to Verify Your Fixcurl -X POST http://localhost:3041/auth/login \ -H "Content-Type: application/json" \ -d '{"email": "test@example.com", "password": "testpass"}'
Should return 200 with JWT token
The difference in output quality is dramatic. I don't have hard metrics, but subjectively I'd say tasks that used to take 3-4 attempts now usually succeed on the first try.
What I still don't know: Is there an optimal prompt structure? I've been iterating on templates, but I'm not sure I've found the best format. If you try this and find improvements, I'd genuinely love to hear about them.
The Supervisor Pattern: When Things Get Complex
For really complex tasks, like migrating a deprecated model across multiple services, a single orchestrator managing many agents gets overwhelming. There's too much to track.
I introduced "supervisor" agents for these cases. They own a track of work and manage their own sub-agents.
The Error Recovery Flow
One thing I'm proud of is the error recovery pattern. When something fails:
The key insight: never retry blindly. When something fails, understand why before trying again. The debugger agent exists specifically for this, diagnosing problems, not fixing them.
What I'm still figuring out: How deep should the hierarchy go? I've tried four levels and it gets confusing. Three seems to be my practical limit, but I'm not sure if that's a fundamental constraint or just my current skill level.
Skills: Codifying What I've Learned
Over time, I noticed I was repeating certain patterns. "When committing, always stage files individually." "When reviewing, check security first." So I started writing these down as "skills", reusable workflows that any agent can reference.
My Current Skills
Development skills:
/frontend - Next.js patterns, component library conventions/backend - Clean Architecture layers, ORM patterns, framework conventions/contracts - Blockchain determinism rules, access control patterns/infra - Kubernetes manifests, namespace conventions, resource limitsProcess skills:
/commit - How we do git commits (staged, per-file, conventional format)/review - Multi-pass code review (logic → security → performance)/debug - Systematic debugging protocolOrchestration skills:
/multi-agent-orchestration - The full hierarchical pattern/systematic-debugging - Step-by-step diagnosis/verification-before-completion - Checklist before declaring doneExample: The Commit Skill
# /commit skill
## Protocol
1. Stage files individually (never `git add .` for multi-file changes)
2. Write descriptive message per file
3. Follow conventional commit format
4. No AI attribution in commits
## Format
type(scope): imperative description
Types: feat, fix, refactor, chore, docs, test
Scopes: auth, docs, wallet, admin, contracts, infra
## Examplegit add src/guards/jwt-auth.guard.ts git commit -m "fix(auth): validate token expiry before checking claims"
git add src/services/token.service.ts git commit -m "refactor(auth): extract token validation to dedicated method"
I invoke this with /commit and the agent knows exactly what to do.
What I don't know: Are skills the right abstraction? Sometimes I wonder if they should be more granular, or if some should be combined. I'm experimenting.
Rules: Automatic Guardrails
Different from skills, rules are always active. They're constraints that every agent must follow, without me having to invoke anything.
My Current Rules
| Rule | What It Enforces |
|---|---|
clean-architecture.md | Domain can't depend on infrastructure |
testing-pyramid.md | 70% unit, 20% integration, 10% E2E |
git-workflow.md | Branch naming, commit format |
security-standards.md | Auth guards, input validation, no hardcoded secrets |
Example: Clean Architecture Rule
# Clean Architecture Rule
## The Dependency Rule
Dependencies MUST point inward only.
✅ Allowed:
- Controller → Use Case → Entity
- Repository Implementation → Repository Interface
❌ Forbidden:
- Entity → ORM Client
- Use Case → Controller
- Domain → anything external
## If You're Unsure
Ask yourself: "Could this inner layer work without the outer layer existing?"
If no, the dependency is pointing the wrong direction.When the reviewer agent checks code, it knows to look for these violations.
Work Records: Remembering Across Sessions
One of my biggest frustrations early on: every conversation started fresh. We'd make decisions, then in the next session, have to re-explain everything.
My solution is aggressive documentation. A work-recorder agent runs continuously, documenting:
Example Work Record
## Session 14: Building the Onboarding Module
**Date:** 2026-02-01
**Focus:** User onboarding flow
---
### The Context
Session 13 completed the document upload feature. But upload is just one step,
we need the full onboarding journey: welcome → identity → documents →
verification → activation.
---
### Key Decisions Made
**Decision: Server-side session state**
We considered client-side state (localStorage) vs server-side (Redis).
Chose server-side because:
- User might switch devices mid-onboarding
- We need to track abandonment for analytics
- Sensitive data shouldn't live in browser
**Decision: Step-by-step validation**
Each step validates before allowing progression. Not just frontend
validation, backend confirms each step's completion before unlocking next.
---
### What We Built
| File | Purpose |
|------|---------|
| src/domain/entities/onboarding-session.entity.ts | Session state machine |
| src/application/use-cases/advance-step/ | Step progression logic |
| app/(onboarding)/layout.tsx | Shared onboarding layout |
---
### What Didn't Work
**First attempt at step validation:**
We tried validating in the frontend route guards. Problem: too easy to
bypass. Moved validation to backend with signed session tokens.
---
### Still Pending
- [ ] Email verification integration (waiting on SMTP config)
- [ ] Analytics events for funnel tracking
- [ ] Error recovery flow if user abandons mid-process
---
### Lesson Learned
Onboarding is a state machine. Should have modeled it that way from the
start instead of treating it as a sequence of pages.Every session starts by reading the previous work record. It's like having notes from yesterday's meeting.
What I haven't solved: The work records can get long. I'm thinking about summarization, keeping detailed records but generating compressed versions for context. Haven't implemented it yet.
The Agent Definitions in Detail
Let me share exactly how I define each agent. These live in configuration files that Claude Code can read.
Architect Agent
# System Architect Agent
## Metadata
- **Model**: opus
- **Tools**: Read, Grep, Glob (NO Edit, NO Write)
- **Role**: Design decisions only
## Responsibilities
1. Design module structures following Clean Architecture
2. Define API contracts with request/response schemas
3. Plan database schema changes
4. Ensure consistency across services
## Output Format
When planning, provide:
1. Module overview (purpose, services affected)
2. File list with descriptions
3. Interface definitions
4. API endpoint specifications
5. Implementation order
6. Risk assessment
## What You DON'T Do
- Write implementation code
- Make changes to files
- Run commands
Your job is to think and design. Implementation is someone else's job.Backend Implementation Agent
# Backend Implementation Agent
## Metadata
- **Model**: sonnet
- **Tools**: Read, Edit, Write, Bash
## Context
- NestJS monorepo structure
- Clean Architecture: domain → application → infrastructure → presentation
- Prisma ORM for database
- Services run on ports 3040-3047
## Your Responsibilities
1. Implement backend features following the design
2. Follow existing patterns in the codebase
3. Write code that passes TypeScript strict mode
## Before Writing Code
- Read similar existing code to match patterns
- Check the schema for correct field names
- Verify port numbers and service locations
## After Writing Code
Run these verifications:npx prisma generate npx tsc --noEmit npm run lint
## What You DON'T Do
- Make architectural decisions (ask the orchestrator)
- Skip verification steps
- Assume patterns without checkingDebugger Agent
# Debugging Specialist Agent
## Metadata
- **Model**: sonnet
- **Tools**: Read, Grep, Glob, Bash (limited)
## Your Process
1. **Gather**: What's the exact error? When did it start?
2. **Reproduce**: Can you trigger it reliably?
3. **Trace**: Follow the error path through the code
4. **Identify**: What's the root cause (not symptom)?
5. **Document**: Report your findings clearly
## Output Format
Return a diagnosis report:
- Summary (one sentence)
- Error details (message, location, environment)
- Root cause analysis (why this happens)
- Affected files (with line numbers)
- Recommended fix (what to change)
## What You DON'T Do
- Apply fixes yourself (that's impl's job)
- Guess without evidence
- Stop at symptoms
Your job is diagnosis. Be thorough. Be certain.Model Routing: Why Different Models for Different Agents
I use three Claude models, and which one depends on the task:
| Model | Cost | When I Use It |
|---|---|---|
| Opus | Highest | Architecture, security, complex decisions |
| Sonnet | Medium | Implementation, validation, most tasks |
| Haiku | Lowest | Fast exploration, prompt writing, documentation |
The principle I follow: use the cheapest model that can do the job reliably.
What I'm uncertain about: Is this the right split? Sometimes I wonder if Opus would catch bugs that Sonnet misses. But I don't have good data on this, just intuition.
What I'm Still Figuring Out
I want to be honest about the limitations and open questions:
1. Agent Memory
Right now, each agent spawn starts fresh. The prompt-writer helps by injecting context, but there's no true memory. I'm exploring whether agents should maintain state between invocations.
2. Cost Management
Multi-agent orchestration uses more tokens than single-agent work. For simple tasks, it's overkill. I'm still developing intuition for when to use the full hierarchy vs. just asking Claude directly.
3. Failure Modes
When an agent produces wrong output confidently, the system doesn't always catch it. The reviewer helps, but isn't perfect. I'd love better automated validation.
4. Observability
It's hard to understand what happened across many agents. I have work records, but no proper traces. Building better observability is on my list.
5. Generalization
This system is tuned for my project. Would it work for others? I think the patterns are general, but the specific agents and skills need customization.
Practical Advice If You Want to Try This
Based on my experience, here's how I'd suggest starting:
Week 1: Three Agents
Start with just three:
Get comfortable with the pattern of separating design from implementation from validation.
Week 2: Add prompt-writer
This single addition will improve everything else. Having an agent that writes good prompts for other agents is multiplicative.
Week 3: Add work-recorder
Start documenting sessions systematically. You'll thank yourself when you return after a break and can read what happened.
Week 4+: Customize
Add agents specific to your stack. For me that's contract-impl and infra-impl. For you it might be mobile-impl or ml-impl.
Get the Template
I've open-sourced the complete architecture as a ready-to-use template. It includes all the agents, skills, rules, and work recording setup described in this post:
Claude Multi-Agent Architecture Template
A production-ready template for building sophisticated multi-agent AI systems with Claude Code.
# Clone the template
git clone https://github.com/mnzralee/claude-multi-agent-architecture.git
# Copy to your project
cp -r claude-multi-agent-architecture/.claude your-project/A Day in My Workflow
Let me describe what this actually looks like in practice.
Morning: Starting a session
1. Read previous session's work record
2. Check current progress on the active module
3. See that email verification task is in_progress
4. Recall: blocked on SMTP config yesterdayWorking on a feature
Me: "Let's implement the email verification flow. The SMTP is now configured."
Claude (orchestrator):
- Spawns architect to review the design
- Architect confirms the approach, identifies files to create
- Spawns prompt-writer to generate impl prompt
- Spawns backend-impl with the detailed prompt
- Impl creates the files
- Spawns tester to write tests
- Spawns reviewer to check the code
- Reports back with summary
Me: "/commit"
- Commit skill stages each file separately
- Writes conventional commit messagesHitting a problem
Test fails: "Token validation error"
Claude:
- Spawns debugger with error context
- Debugger traces through the code
- Identifies: wrong environment variable name
- Spawns prompt-writer for fix prompt
- Spawns backend-impl to apply fix
- Spawns tester to verify fix
- Reports resolution
Work-recorder logs the whole journey.Ending the session
Me: "Let's wrap up"
Claude:
- work-recorder compiles session summary
- Updates progress tracking
- Lists what was completed, what's pending
- Suggests next session focusConclusion: What This Taught Me
Building this system taught me more about software engineering than about AI. The principles that make multi-agent orchestration work are the same principles that make human teams work:
AI doesn't eliminate these needs, it makes them more visible. When you're working with agents, you can't rely on implicit understanding. Everything must be explicit. And that explicitness, it turns out, makes the whole system better.
I don't think I've figured out the optimal approach. This is year one of a multi-decade shift in how software gets built. I'm learning in public, making mistakes, iterating.
If you try any of this, I'd genuinely love to hear what works for you and what doesn't. The best ideas for improving this system have come from others experimenting with similar approaches.
---
This isn't the future of how AI replaces engineers. It's the future of how engineers work with AI. The orchestrator's job, seeing the big picture, making judgment calls, deciding what matters, that's still fundamentally human. We're just getting better tools.
---
This is a living system, I update it regularly as I learn. Last updated: 2026-02-02.
