agentic-coding · 2026-05-17
What Does Not Work in Agentic Coding — Lessons from 8 Conductor Versions
TL;DR
Eight major versions of Conductor in 24 months, each with an architecture break. This post lists the seven concrete failures that these breaks cost me — each with cause, symptom, and fix. Not a best-practice list, but an anti-best-practice list: what breaks agentic coding if you do not know. The “Agentic Coding Best Practices” discourse is saturated (Anthropic engineering blog, Maven, GoDaddy, Google Cloud, McKinsey). What is missing: honest failure reports with token-budget deltas.
Featured image placeholder: stylized version timeline with break points. Asset pending — prompt at end.
Table of Contents
- Failure 1: The mega-prompt was a lie
- Failure 2: Cleverness instead of granularity (the v5 rollback)
- Failure 3: Synchronous MCP calls in the agent loop
- Failure 4: Skills without precise description lines
- Failure 5: Subagents as conversation containers
- Failure 6: No hooks before quality gates
- Failure 7: Cross-repo assumptions without verification
- What the best-practice posts leave out
- FAQ
Failure 1: The mega-prompt was a lie
Version: Conductor v1.0 (May 2024) Symptom: Responses became sloppy, irrelevant, hallucinated after 20 min of session. Cause: I had an 8,000-word system prompt that described “everything” — PRD, API, DB, tests, release notes. Thought: more context = better answers.
What actually happened: the mega-prompt consumed the context window from the start. Even the first response had spent 60% of the available token budget on system-prompt re-reading. Each subsequent interaction got worse because the model spent more energy re-parsing the prompt than on the actual task.
Fix in v2.0: split into 4 smaller plugins. Each plugin contains only what is relevant for its specific phase. Token budget saved: ~70%.
Lesson: “More context = better” is wrong above a certain size. One plugin per domain, not one mega-plugin per solo founder. Anthropic’s engineering blog on context engineering has since confirmed this — but when I built v1, this doc was not public.
Failure 2: Cleverness instead of granularity (the v5 rollback)
Version: Conductor v5.0 (April 2025) — rolled back after 2 weeks Symptom: Output became sluggish, response quality worse than v4, switching between phases slower. Cause: I had tried to build a “smart-router” plugin architecture. One master plugin was supposed to decide intelligently which sub-plugin to load. Decision tree with ~20 branches.
What actually happened: the smart router was itself too complex. For every request Claude had to walk through the decision tree before the actual work could start. The “intelligent” pre-stage ate ~300-500 tokens per call, without the routing decision being better than a trivial heuristic.
Fix in v6.0: smart router removed entirely. Plain slot strategy: you say which plugin family is relevant, Conductor loads it directly. Latency saved: ~25%. Output quality: +1 tier.
Lesson: dumb-explicit beats clever-implicit almost always in the agent loop. When you build an “intelligent” decision layer, ask first: could a human make that call in 5 seconds? If yes, the intelligent layer is overhead.
Inline image placeholder: visualization of the first two failures as a learning arrow. Asset pending.
Failure 3: Synchronous MCP calls in the agent loop
Version: Conductor v6.0-v6.3 (August to October 2025) Symptom: Agent loop blocked for 5-10 seconds on every tool call. Workflows with 15+ calls became unusable. Cause: I had built a Yahoo Finance MCP server. Every call was synchronous, taking ~400ms median, sometimes 2-3s on quota limits or network hiccups.
What actually happened: in a typical AI Trader workflow with 30 symbol lookups, the MCP latencies summed to 12-90 seconds — on top of Claude’s normal response time. Latency was so noticeable that I stopped using the Yahoo MCP server altogether.
Fix in v7.0: MCP server removed. Replaced by a Python script that caches Yahoo data locally every 5 minutes. Claude accesses the cached data via the standard Bash tool. Latency reduction: ~95%.
Lesson: MCP is the most expensive extension class (see my plugins-vs-skills post). 200-500ms per call best case, often more in reality. Only use it when the service character is mandatory — and never for data you can cache locally.
Failure 4: Skills without precise description lines
Version: Conductor v3.5 (November 2024) Symptom: Skills never loaded automatically — or loaded too often, at the wrong point. Cause: The description lines of my Skills were vague: “This skill handles testing”, “This is for documentation”. Claude could not distinguish between similar contexts.
What actually happened: for “How many tests does this module have?” (question) and “Write tests for this module” (task), the same Skill loaded. But the Skills had to provide different knowledge.
Fix in v4.0: descriptions written like test names. Before: “This skill handles testing”. After: “Use when user wants to generate new test files for an existing module that lacks test coverage. Do NOT use for reading or analyzing existing tests.” Skill hit rate: from ~30% to ~85%.
Lesson: Skill descriptions are API contracts with Claude. Vague descriptions = breaking the contract. Anthropic’s official skills documentation shows the format — but the discriminating sharpness has to come from you.
Failure 5: Subagents as conversation containers
Version: Conductor v6.5 (September 2025) Symptom: Subagent workflows broke mid-task or produced half-finished output. Cause: I had tried to use subagents iteratively. “Hey subagent, do step 1. Okay good, now step 2 with the result of step 1.”
What actually happened: subagents have no persistence between invocations. The second call knew nothing about the first — it started with fresh context. The iterative pattern I had attempted was never what subagents were designed for.
Fix in v7.0: subagents only for one-shot heavy work. Iterative workflows stay in the main conversation or are realized as a sequence of separate tasks with explicit hand-off (markdown file as state). Workflow drop-out rate: from ~40% to ~5%.
Lesson: subagents are fresh-context one-shots, not conversation partners. If you want to iterate, you need either main Claude or explicit state passing between separate subagent calls.
Failure 6: No hooks before quality gates
Version: Conductor v1-v6 (May 2024 to November 2025) Symptom: production bugs from commits that had missed tests or were pushed without build checks. ~5-8 bug incidents in 18 months. Cause: I relied on “remembering manually” to run tests before every push. ADHD-typical forgetting tax in practice.
What actually happened: manual discipline does not scale in 8-parallel-projects mode. I forgot a quality check on average once per week. Of those, ~30% landed in production bugs.
Fix in v7.0: hooks system introduced. Every plugin can define PreToolUse hooks that run mandatorily before a git push. finalizer plugin enforces: build green, tests green, lint clean — or no push. Bug incidents since v7.0: 1 (over 6 months).
Lesson: Hooks > conventions. What you cannot enforce, you forget. Especially in hyperfocus mode.
Failure 7: Cross-repo assumptions without verification
Version: Conductor v8.0 (March 2026) Symptom: Cross-repo refactorings failed sporadically. “Plugin A uses convention X, but plugin B uses convention Y.” Cause: I assumed all 13 Conductor plugins used the same internal conventions. Reality was that after 24 months of evolution they had long become inconsistent.
What actually happened: A new marketing-pilot plugin used a convention that had been deprecated since v6. But nothing had explicitly retired the old convention. Cross-plugin calls failed unpredictably.
Fix in v8.1.0: cross-repo convention check as part of the hooks stage. Currently partially fixed — full solution comes with v9, since this is a systematic problem that needs more tooling.
Lesson: Implicit conventions in multi-plugin systems rot. Without explicit convention definitions (versioned, with migration path), you will lose to cross-plugin bugs. Microservices discipline applies here 1:1.
Inline image placeholder: lessons map of the seven failures. Asset pending.
What the best-practice posts leave out
Before this post I reviewed the ten most-read “Agentic Coding Best Practices” articles (Anthropic Engineering, Google Cloud, McKinsey QuantumBlack, Maven, GoDaddy, timdeschryver.dev, MindStudio, plus Reddit threads). Seven of those ten contain no failure reports. Three have a brief sentence on the topic.
That is the discourse gap. Best practices are only half as instructive as the failures they were condensed from. What you read above are the failures — the “best practice” “separate plugins, scoped contexts, hooks before push, precise skill descriptions, one-shot subagents, no MCP for cacheable data, explicit cross-repo conventions” is the distilled form. But the distilled form alone does not tell you why it matters.
Where this goes next
Conductor v9 will primarily address three of these failures systematically:
- Cross-repo convention versioning (Failure 7) — with migration paths for deprecated patterns
- Subagent state passing as a first-class concept (Failure 5) — no more iterative-subagent patterns needed
- Multi-LLM orchestration — a new class of failures that has not yet appeared but will when Codex/Kimi are called for specific sub-tasks
If you build an agentic-coding stack yourself and are stuck on any of the failures above: reach me on LinkedIn or GitHub.
FAQ
Which of the seven failures cost the most time?
Failure 2 (smart-router v5). 2 weeks of build time, 1 week of rollback pain, plus ~3 weeks of mental reset until I could actually approach v6 fresh. Total: ~6 weeks.
Should I attempt agentic coding at all if so many failures are possible?
Yes, but start smaller than I did. v1 with mega-plugin was hubris. Start with 1-2 custom commands, escalate to skills for repeated workflows, build plugins only after ~6 months of productive use. That avoids 5 of the 7 failures.
Are these failures Anthropic-specific or do they apply to Codex / Cursor?
Failures 1, 2, 5, 6 are LLM-agent-cross-cutting. Failures 3, 4, 7 are Claude-Code-specific because of MCP/Skills/Plugin mechanics. If you are on Cursor or Codex, your system has its own structural failures I do not know.
When is a Conductor rollback (like v5) the right call?
When after 2 weeks of measurable use it is worse than the previous version. Do not wait until it is “objectively bad” — by then you have lost 4-6 weeks. My rule of thumb: if I check the data after 2 weeks and the output is measurably worse, rollback within that same week.
Which failure was the most embarrassing?
Failure 6 (no hooks before quality gates). For 18 months I thought I could compensate manually. ADHD never forgave me for that. Production bugs in 6 of 18 months — all from the same mechanism. Should have been fixed right after v1.
Written on May 17, 2026 in Hamburg. If you find this post useful, link to it — failure reports are underrepresented in the agentic-coding discourse.