codeintelligently
Back to posts
Engineering Leadership

Engineering Leadership in the AI Era: What Changed

Vaibhav Verma
15 min read
engineering leadershipAIteam managementengineering metricshiringcode review

Engineering Leadership in the AI Era: What Changed

I've been managing engineering teams for 14 years. The last two have felt like a completely different job.

Not because AI wrote some code for me. Because the entire operating model of engineering leadership shifted underneath my feet. The way I hire, the way I measure output, the way I think about team structure, the way I evaluate architecture decisions, the way I communicate with the board. All of it changed.

Most of what you'll read about "AI and engineering leadership" is surface-level stuff about Copilot adoption rates. That's table stakes. The real changes are structural, and if you're leading an engineering org right now without rethinking your fundamentals, you're already behind.

Here's what actually changed and what I've done about it.

The Productivity Illusion

The first thing every CEO asks: "Are your engineers 10x more productive with AI?"

No. They're not. And if your CTO tells you they are, they're either lying or measuring the wrong things.

Here's what I've actually seen across three teams (totaling about 180 engineers):

  • Lines of code per engineer per week went up 40-60%. That sounds great until you realize lines of code is a garbage metric. Always has been.
  • Time to first PR on a new codebase dropped by about 35%. This one matters. Onboarding got genuinely faster.
  • Bug density stayed roughly flat. More code, same percentage of bugs, which means more total bugs. We had to invest more in testing and review.
  • Architecture quality actually declined initially. Engineers accepted AI suggestions that worked locally but created systemic problems. More on this below.

The productivity gains are real but they're not evenly distributed. Senior engineers who already knew what good code looked like got faster. Junior engineers produced more code but not necessarily better code. The gap between "writes code" and "solves problems" got wider.

The Contrarian Take: AI didn't make engineering leadership easier. It made the difference between good and bad engineering leadership more visible, faster. Bad teams ship bad code faster now. Good teams ship good code faster. The delta between them grew.

The New Hiring Equation

I used to hire for three things: raw problem-solving ability, domain knowledge, and cultural fit. That formula is broken now.

Here's my updated framework.

The CAPE Hiring Framework

C - Contextual Judgment: Can this person look at an AI-generated solution and know whether it's actually good? This requires deep understanding of system design, not just syntax. I test this by giving candidates an AI-generated PR with three subtle architectural problems and asking them to review it. The best candidates spot all three in under 15 minutes.

A - Architectural Thinking: Can they design systems, not just build features? AI is great at implementing well-defined tasks. It's terrible at deciding what to build and how components should interact. I weight system design interviews 40% heavier than I did two years ago.

P - Product Instinct: Can they translate user problems into technical solutions without a spec? As AI handles more implementation, the engineers who thrive are the ones who understand why they're building what they're building. I now include a product case study in every engineering interview loop.

E - Editing Over Authoring: Can they refine, improve, and debug AI-generated code efficiently? This is a genuinely new skill. Some engineers are brilliant authors but terrible editors. I've added pair-debugging sessions where candidates work through AI-generated code with subtle bugs.

The result: my last three hires were engineers who, on paper, looked less impressive than other candidates. Fewer years of experience, fewer side projects. But their judgment was better. They could tell good code from plausible code. That distinction matters more now than ever.

Team Structure in the Age of AI

I ran the classic pod model for years. Cross-functional teams of 5-8 engineers, each owning a domain. It worked well.

Then AI tools arrived and something weird happened. Individual engineers could cover more ground. The optimal team size started shrinking. But the need for coordination and review increased because AI-generated code requires more oversight, not less.

The Structure I've Landed On

Core Pods (3-4 engineers each):

  • Smaller than before. Each engineer covers more surface area because AI accelerates individual output.
  • Every pod has at least one senior engineer whose primary job is review and architecture. Not writing code. Reviewing it.

Architecture Guild (2-3 senior engineers, rotating):

  • Meets twice a week. Reviews cross-cutting concerns.
  • Owns the "AI interaction patterns" document that defines how the org uses AI tools.
  • Has veto power on architectural decisions, which they use about once a month.

Quality Engineering Team (dedicated):

  • This team didn't exist two years ago. It exists now because AI-generated code requires different testing strategies.
  • They own the integration test suite, performance benchmarks, and security scanning pipeline.
  • They're not QA in the traditional sense. They're engineers who specialize in catching the kinds of mistakes AI tools make.

The key insight: AI makes individual engineers more productive but makes coordination harder. You need fewer hands writing code and more hands ensuring the code works together.

What Happened to Code Review

Code review used to be primarily about catching bugs and sharing knowledge. Now it's the most important process in the entire engineering org.

Here's why: when an engineer writes code from scratch, the code reflects their understanding (or misunderstanding) of the system. You can see their thought process. When AI generates code and an engineer approves it, you can't tell whether the engineer understood what they approved or just saw that the tests passed.

My Updated Code Review Protocol

  1. Every PR must include a "Why" section. Not what the code does. Why this approach was chosen over alternatives. If the engineer used AI to generate the code, they need to explain what constraints they gave the AI and what alternatives they considered.

  2. Diff size limits got smaller, not bigger. We went from a soft limit of 400 lines per PR to 250. AI makes it easy to generate massive PRs. Massive PRs get rubber-stamped.

  3. Mandatory architecture review for anything touching more than two services. This catches the cross-cutting problems that AI tools consistently miss.

  4. Weekly "code archaeology" sessions. A senior engineer picks a recent PR and walks through it line by line with the team. Not to criticize. To teach. The goal is building the judgment muscle that lets engineers evaluate AI output.

Our defect escape rate (bugs that reach production) dropped 23% after implementing these changes. Not because we write less code, but because we catch more problems before they ship.

Measuring What Matters Now

I threw out half my engineering metrics last year. Here's what I kept and what I added.

Metrics I Kept

  • Deployment frequency: Still matters. Are we shipping?
  • Change failure rate: Still matters. Are we breaking things?
  • Time to restore service: Still matters. How fast do we recover?

Metrics I Dropped

  • Lines of code: Always bad. Now meaningless.
  • Story points completed: Velocity became unreliable when AI made some tasks trivially fast and didn't help with others at all.
  • Individual PR count: Gameable. Always was, but now even more so.

Metrics I Added

  • Review depth score: A composite metric based on time spent in review, comments per PR, and architectural questions raised. Low scores correlate strongly with escaped defects.
  • Decision documentation rate: What percentage of architectural decisions have written ADRs? This measures whether the team is thinking, not just producing.
  • AI-assist ratio by complexity tier: We categorize tasks into three tiers. For Tier 1 (straightforward implementation), high AI usage is fine. For Tier 3 (complex architectural work), high AI usage is a red flag. This ratio tells me whether engineers are using AI appropriately.
  • Rework rate: What percentage of shipped code gets modified within 30 days of deployment? AI-heavy code has a 2.3x higher rework rate in our data. We're working on bringing that down.

Communicating with the Board

Every board meeting, someone asks about AI. Here's my framework for having that conversation without overselling or underselling.

The Three Buckets

Bucket 1: Where AI is genuinely saving us money. I'm specific. "AI coding tools reduced time-to-first-commit for new hires by 35%, which saves us approximately $40K per engineer in onboarding costs. With 30 hires this year, that's $1.2M." Real numbers. Real impact.

Bucket 2: Where AI is changing what we build. "We shipped a natural language search feature in 6 weeks that would have taken 4 months without LLM APIs. That feature drove a 12% increase in user engagement." Show the product impact, not just the engineering efficiency.

Bucket 3: Where AI requires investment. "We're spending an additional $180K/year on AI tooling licenses, $240K on the quality engineering team we built to manage AI-generated code quality, and about $90K on increased compute for our expanded test suite." Don't hide the costs. Show the ROI.

This framework works because it's honest. Boards can smell hype. Give them numbers.

The Architecture Problem Nobody Talks About

Here's the thing that keeps me up at night: AI tools are creating a new kind of technical debt that's harder to see and harder to fix.

I call it "plausible debt." The code works. The tests pass. The PR looks reasonable. But the architectural decisions embedded in that code are subtly wrong. The service boundaries are in the wrong place. The data model doesn't account for a use case that's coming in Q3. The caching strategy works for current load but won't scale.

A human engineer making these decisions would have context. They'd know about the Q3 roadmap. They'd remember that last time we put the boundary there, we had to refactor. AI doesn't have that context.

How I'm Fighting Plausible Debt

  1. Quarterly architecture audits. Not just "does this code work?" but "does this architecture serve our next 18 months?"

  2. Decision logs. Every significant architectural choice gets logged with context, alternatives considered, and rationale. When AI generates code that implicitly makes an architectural choice, someone has to make that choice explicit.

  3. Refactoring budget. I allocate 20% of every sprint to addressing architectural issues. Not feature work. Not bug fixes. Architecture. This is up from 10% pre-AI.

  4. Cross-team architecture reviews. Monthly sessions where teams present their architecture to other teams. The outsider perspective catches drift that insiders are too close to see.

What I Got Wrong

I want to be honest about my mistakes.

Mistake 1: I waited too long to update the interview process. We hired three engineers using the old process before I realized we were optimizing for the wrong skills. Two of them are great. One is struggling because they can't evaluate AI output effectively.

Mistake 2: I underestimated the testing impact. I assumed AI-generated code would need the same testing strategy. It doesn't. AI makes different kinds of mistakes than humans do. More edge case failures, more integration issues, fewer syntax errors. Our test suite wasn't designed for that.

Mistake 3: I tried to measure AI's impact with pre-AI metrics. For three months, I kept reporting the same DORA metrics and wondering why the story wasn't making sense. The old metrics weren't wrong, but they were incomplete.

The Framework: AI-Era Engineering Leadership

Here's the stealable framework I've landed on. I call it the 4R Model.

The 4R Model

Review: Invest disproportionately in code review. It's your quality firewall. Train engineers to review AI output the way you'd review a junior engineer's code.

Rethink: Rethink your team structure, metrics, and hiring for a world where code generation is cheap and judgment is expensive.

Require: Require documentation of decisions, not just code. When AI generates the implementation, the thinking behind the design is the only thing that prevents drift.

Resist: Resist the pressure to show impossible productivity gains. Be honest about costs, risks, and timelines. Your credibility as a leader depends on it.

Looking Forward

Engineering leadership in the AI era isn't about adopting tools. It's about maintaining quality, judgment, and intentionality when the cost of producing code approaches zero.

The leaders who thrive will be the ones who understand that cheaper code doesn't mean cheaper engineering. It means the hard parts of engineering, the thinking, the architecture, the coordination, matter even more than they used to.

I don't have all the answers. Nobody does right now. But I know this: if you're leading an engineering org the same way you did in 2023, you're falling behind. Not because of the tools. Because the game changed, and the old playbook doesn't account for it.

Start with the 4R Model. Audit your metrics. Redesign your review process. And be honest with your board about what's working and what isn't.

That's the job now.

$ ls ./related

Explore by topic