The 5 Levels of AI Agentic Software Development (2026)

By Dominik Fretz|19 March 2026|10 min read

agentic-engineeringai-maturitysoftware-architecturedark-factory

The 5 Levels of AI Agentic Software Development

The first rule of agentic AI software engineering: humans must not write code.

Second rule: humans must not review code.

That opening got 43,000 impressions when I posted it on LinkedIn. Not because it's provocative for the sake of it — but because it names a future that most engineers can feel arriving but haven't been able to articulate.

Here's the expanded version, with the data, the framework, the real-world examples, and practical guidance for figuring out where you actually are — and how to move up.

The Perception Gap That's Holding You Back

A rigorous 2025 study found that experienced developers using AI tools took 19% longer to complete tasks than developers working without them. But here's what should really concern you: the developers believed AI made them 24% faster.

They were wrong — not just about the direction, but about the magnitude of the change.

This perception gap is where careers get stuck. You think you're being productive. The data says otherwise. And it's not because the tools are bad. It's because the tools are being used inside workflows that weren't designed for them.

Meanwhile, at the other end of the spectrum: StrongDM runs a software factory with three engineers where no one writes code and no one reviews code. The system takes specifications, builds software, tests against behavioural scenarios, and ships it autonomously. Humans write specs and evaluate outcomes. Machines do everything in between.

At Anthropic, 90% of Claude Code's codebase was written by Claude Code itself. Boris Churzin, who leads the project, hasn't personally written code in months. OpenAI's Codex 5.3 was instrumental in creating itself — earlier builds analysed training logs, flagged failing tests, and suggested fixes to training scripts.

The gap between "AI makes me a bit faster at typing" and "AI builds our software autonomously" isn't a technology gap. It's a maturity gap. And there are exactly five levels to it.

The Five Levels

Level 0: Spicy Autocomplete

GitHub Copilot's original format. Tab-complete on steroids. The AI predicts the next few lines and you accept or reject. It's faster than typing, but it doesn't change how you think about building software.

Impact: marginal speed increase on boilerplate. No change to architecture, planning, or workflow.

Level 1: Coding Intern

You hand discrete tasks to AI and review everything that comes back. "Write a function that validates email addresses." "Create a React component for a user profile card." The AI does the task. You read every line.

This is where most developers start with ChatGPT or Claude. It works. It's genuinely useful. But you're still the bottleneck — you're just outsourcing the typing to an intern you don't fully trust.

Level 2: Junior Developer

The AI handles multi-file changes and navigates codebases. You give it a feature description, it creates or modifies multiple files, and the result mostly works. But you're still reading all the code.

Here's the reality check: 90% of developers who say they're "AI-native" operate at Level 2. They think they're further along than they are. The perception gap from the study? This is where it lives.

Level 2 is also where the J-curve hits hardest. You've added AI to your existing workflow without changing the workflow itself. You're running a new engine on an old transmission, and the gears are grinding. This is why the study found developers getting slower — they're at Level 2, doing all the same reviews and checks they always did, plus the overhead of managing the AI interaction.

Level 3: Engineering Manager

This is where the psychology gets hard. At Level 3, you stop writing code and start directing AI agents. You review features at the PR level instead of reading every line. You're managing the AI the way a senior engineer manages a team of juniors — setting direction, reviewing outcomes, catching architectural mistakes.

Most people top out here because of the psychological difficulty of letting go of the code. Engineers identify as people who write code. Becoming someone who directs code-writing agents feels like a demotion, even when it's objectively a promotion in terms of output and impact.

The Spec → Phases → Plan → Implement methodology I teach in my workshops is specifically designed to make Level 3 work reliably. The framework gives you confidence that the AI's output will be good without needing to read every line — because the specification, the plan, and the guardrails constrain the solution space.

Level 4: Product Manager

You write a specification. You leave. You come back hours later to check if tests pass. You're not reading code anymore, just evaluating outcomes.

This requires two things most organisations don't have:

Specifications rigorous enough for AI agents to implement correctly without human intervention. Most orgs have never needed this level of spec quality because the humans on the other end could fill gaps with judgment, context, or a quick Slack message asking for clarification. AI agents build exactly what you described. If what you described was ambiguous, you get software that fills gaps with guesses — not customer-centric judgment.
Testing infrastructure that validates behaviour, not just code correctness. Traditional test suites check that functions return expected values. Level 4 requires tests that check whether the software actually does what users need. This is a fundamentally different thing.

This is where PRD-driven development becomes non-negotiable. The quality of your spec is the bottleneck, not the quality of the AI.

Level 5: The Dark Factory

Specifications go in. Working software comes out. No human writes or reviews code.

Almost nobody operates here. But it's where the industry is heading, and real companies are doing it today.

What Level 5 Actually Looks Like: The StrongDM Architecture

StrongDM's software factory is the best public example of Level 5 in production. Their architecture reveals what it actually takes:

Behavioural scenarios instead of tests. They don't use traditional software tests. They use scenarios — behavioural specifications stored separately so the AI agent can't see them during development. It's the same concept as holdout sets in machine learning to prevent overfitting. The agent builds software. Scenarios evaluate whether it works. But the agent never sees the evaluation criteria, so it can't game the system.

Probabilistic satisfaction instead of boolean success. They moved from "tests pass/fail" to "of all observed trajectories through all scenarios, what fraction actually satisfies the user?" This mimics aggressive external QA testing — expensive but highly effective in traditional software — without needing the QA team.

Digital twin universe. Behavioural clones of every external service the software interacts with. Simulated Okta, Jira, Slack, Google Workspace. How do you clone these services? Dump their full public API documentation into coding agents and have them build Go binaries that replicate the APIs, then add simplified UIs over the top.

Suddenly you have rate-limit-free clones you can test against at volumes exceeding production limits, test failure modes that would be dangerous against live services, and run thousands of scenarios per hour without API costs.

Their philosophy: If you haven't spent $1,000 per human engineer per day on compute, your software factory has room for improvement. That's not a joke — it enables AI agents to run at a volume that makes compute costs meaningful while often still being cheaper than the humans they're replacing.

They identified the inflection point early — October 2024's Claude Sonnet 3.5 revision, when long-horizon agentic coding started compounding correctness rather than error. By December 2024 the model's performance was unmistakable. They founded the team in July 2025 based on that insight.

Why Most Companies Are Stuck

They bolted AI onto existing workflows and productivity dipped before improving — the classic J-curve adoption pattern. They're running new engines on old transmissions and the gears are grinding.

The organisations seeing 25-30% productivity gains didn't just install tools and call it done. They redesigned their entire development workflow around AI capabilities — how they write specs, how they review code, how they structure CI/CD pipelines, what they expect from engineers at different levels.

End-to-end transformation is hard, politically contentious, and expensive. Most companies don't have the stomach for it, which is why most companies are stuck at the bottom of the J-curve.

The Skills Gap Nobody Talks About

The bottleneck moved from implementation speed to spec quality.

If you can't write a specification detailed enough for an AI agent to implement correctly without human intervention, you can't operate at Level 4 or 5. Most organisations have never needed that level of rigorous systems thinking because humans on the other end could fill gaps with judgment and context.

This is the real skills gap in 2026. Not "can you use AI tools?" — everyone can use AI tools. The gap is "can you think precisely enough to direct AI agents that execute literally?"

It's the difference between telling a human colleague "build the auth system" (they'll figure it out) and writing a specification that an AI agent can implement without ambiguity (every decision must be explicit).

How to Move Up

Here's my practical advice for moving between levels:

Level 0 → 1: Start using Claude Code or Cursor for discrete tasks. Build the habit of delegating implementation.

Level 1 → 2: Give the AI larger tasks. Features, not functions. Let it make multi-file changes. But keep reviewing everything — this level is about building trust in the tool.

Level 2 → 3: This is the hardest transition. Start with the Spec → Phases → Plan → Implement methodology. Write specs first, have the AI plan, review the plan instead of the code. Use subagents and worktrees for parallel work. Gradually shift your review from "reading every line" to "evaluating at the feature level."

Level 3 → 4: Invest in specification quality. Learn to write PRDs that are rigorous enough for autonomous implementation. Build testing infrastructure that validates behaviour, not just code correctness. Start measuring outcomes (does it work for users?) instead of outputs (does the code look right?).

Level 4 → 5: This requires organisational commitment, not just individual skill. Behavioural scenario testing, digital twin infrastructure, and a willingness to spend on compute instead of engineering hours. Most companies aren't here yet, and that's fine — Level 4 is already transformative.

The Question

Which level are you actually operating at?

Not where you think you are. Where the evidence shows you are.

Because the gap between perception and reality is where careers get stuck, where companies fall behind, and where the future gets built by someone else.

If you want to move up the levels with structured guidance, I run workshops that take engineers from Level 1-2 to Level 3-4 with hands-on practice on real projects. The methodology works because it addresses the psychological and process barriers, not just the technical ones.

Dominik Fretz | Anthropic Claude Community Ambassador | LinkedIn

Want to discuss agentic AI engineering?

I help engineering teams adopt AI without creating tomorrow's legacy nightmare.

Book a Discovery Call