codeintelligently
Back to posts
AI & Code Quality

AI Pair Programming: What Works and What Doesn't

Vaibhav Verma
10 min read
aipair-programmingproductivitycode-qualitydeveloper-toolshonest-review

AI Pair Programming: What Works and What Doesn't

I've been using AI as my pair programming partner for 22 months. Not occasionally. Daily. For production code, prototypes, debugging, refactoring, and everything in between. And my honest assessment is that most of the advice about AI pair programming is wrong.

The typical take goes something like: "AI is great for boilerplate, bad for complex logic." That's too simplistic. I've tracked my AI pair programming sessions for the last 8 months. 847 sessions across 6 different AI tools. The reality is much more nuanced than "good for easy stuff, bad for hard stuff."

The Data: 847 Sessions Tracked

I logged every AI pair programming session with: task type, tool used, quality of output (1-5 scale), time saved or lost versus doing it manually, and whether the AI output required significant rework.

Here's the summary:

Task Type Sessions Avg Quality Time Impact Rework Rate
CRUD endpoints 112 4.2/5 65% faster 12%
Database queries 89 3.8/5 50% faster 18%
UI components 134 3.9/5 55% faster 22%
Business logic 97 2.7/5 10% faster 45%
Bug fixing 143 3.1/5 30% faster 35%
Refactoring 78 3.4/5 40% faster 28%
Test writing 104 2.9/5 35% faster 41%
Architecture decisions 56 1.8/5 20% slower 68%
Performance optimization 34 2.3/5 15% slower 52%

Two things jump out. First, AI pair programming makes you faster at almost everything, even when the quality is mediocre. That's because "faster with rework" is often still faster than "manual from scratch." Second, there are 2 categories where AI actively slows you down: architecture decisions and performance optimization.

What Actually Works

1. Test-First AI Pairing

This is my highest-value pattern. Write the test descriptions yourself, then let AI implement both the tests and the code.

typescript
// I write this (takes 5 minutes):
describe("InvoiceCalculator", () => {
  it("should apply 10% discount for annual billing");
  it("should prorate for mid-cycle upgrades based on remaining days");
  it("should refuse downgrade if usage exceeds new tier limits");
  it("should handle currency conversion for international accounts");
  it("should round to 2 decimal places after all calculations");
  it("should throw InvoiceError for negative quantities");
});

// AI implements both tests and production code (takes 10 minutes)
// I review for correctness (takes 15 minutes)
// Total: 30 minutes vs ~90 minutes fully manual

Why this works: you're providing the intent and the edge cases. AI is providing the implementation. The hard part of testing is knowing what to test. The easy part is writing the assertions. AI is great at the easy part.

Over 104 test-writing sessions, test-first pairing had a 22% rework rate versus 58% when I let AI write tests after implementation. The quality difference is dramatic.

2. Rubber Duck Debugging With AI

This surprised me. AI isn't great at directly fixing bugs. But it's excellent as a rubber duck that talks back.

My debugging workflow:

  1. Describe the bug to the AI in detail
  2. Share the relevant code and the error
  3. Ask it for 5 possible causes (not the fix, the causes)
  4. Evaluate each cause myself
  5. Ask for fix suggestions only after I've identified the root cause

When I asked AI to "fix this bug" directly, the fix was wrong 52% of the time. When I used the structured debugging approach, I found the root cause 78% of the time, and the subsequent fix was correct 89% of the time.

The difference is that asking AI to fix a bug gives it a needle-in-a-haystack problem. Asking it to brainstorm causes gives it a pattern-matching problem, which is what language models are actually good at.

3. Refactoring With Constraints

AI refactoring works well when you give explicit constraints:

Refactor this function with these constraints:
- Keep the same public API (same parameters, same return type)
- Extract the validation logic into a separate function
- Replace the nested if/else with early returns
- Don't change the error messages (they're used in monitoring)
- Keep the database call at the same position in the flow

Without constraints, AI refactoring tends to change the architecture. With constraints, it produces focused improvements. My refactoring sessions with explicit constraints had a 15% rework rate. Without constraints: 44%.

4. Documentation Generation From Code

This is the one area where AI quality is consistently high. Give it a function, ask for JSDoc comments and a usage example, and the output is usable 90%+ of the time. I don't track this formally anymore because it just works.

What Doesn't Work

1. Architecture Decisions

I have 56 sessions logged where I asked AI for architecture guidance. The average quality was 1.8/5. Here's why: AI gives you the most common architecture for your description. But good architecture depends on your specific constraints, team size, expected scale, and business context. AI doesn't know any of that.

Worst case I saw: I asked for a recommendation on event-driven versus request-response for a microservice communication pattern. AI recommended event-driven with Kafka. My team is 4 people with a monorepo. We don't need Kafka. We need direct function calls. The AI's suggestion would have added months of infrastructure complexity.

2. Performance Optimization

AI consistently suggests "standard" optimizations that don't address the actual bottleneck. It'll suggest adding indexes when your problem is N+1 queries. It'll suggest caching when your problem is a missing database connection pool limit. It'll suggest memoization when your problem is synchronous file I/O blocking the event loop.

Performance optimization requires understanding your specific runtime environment, your data patterns, and your load profile. AI has none of that context.

3. Complex Business Logic

When business rules involve multiple conditional paths, edge cases from real-world usage, and interactions between subsystems, AI produces code that handles the obvious cases and misses the subtle ones.

I had a billing calculation that needed to handle 4 discount types, 3 billing cycles, prorated upgrades, and currency conversion. AI got about 70% of the logic right. But the 30% it missed were the edge cases that cause billing disputes. In the end, debugging the AI output took longer than writing it manually would have.

4. Legacy Code Modifications

AI struggles with large, legacy codebases because it doesn't understand the hidden dependencies. It'll suggest changes that work in isolation but break something 3 files away. In legacy systems, the "right" approach is often counterintuitive, and AI defaults to the textbook answer.

The Contrarian Framework: The 3-Question Filter

Before every AI pair programming session, I ask 3 questions:

  1. Is the task well-defined? If I can't describe the exact input, output, and constraints in 3 sentences, AI won't produce good code. Don't use AI for exploratory work.

  2. Would I know if the output is wrong? If I can't evaluate the AI's code confidently, I shouldn't be generating it. This is especially true for security-sensitive code and complex algorithms.

  3. Is the task pattern-based or judgment-based? Pattern-based tasks (CRUD, boilerplate, standard UI) are AI-appropriate. Judgment-based tasks (architecture, performance, business rules) are human-appropriate.

If a task passes all 3 questions, AI pair programming will likely save time. If it fails any of them, go manual.

My Current Setup

After 847 sessions across 6 tools, here's what I actually use daily:

  • Cursor for file-level edits and refactoring within existing code
  • Claude for design discussions, debugging brainstorming, and test case generation
  • Copilot for inline completions while typing

I've stopped using AI for:

  • Any decision that affects more than one service
  • Security-critical code (auth, encryption, access control)
  • Performance-sensitive hot paths
  • Anything I couldn't write myself

That last point is important. AI pair programming is most valuable when you're fast-tracking work you already know how to do. It's least valuable when you're using it as a crutch for skills you don't have.

A Year From Now

AI pair programming tools are improving fast. The areas where AI fails today will shrink. But I don't think the fundamental pattern will change: AI will remain excellent at implementation and weak at judgment. Plan your usage accordingly.

Track your own sessions. The 5 minutes it takes to log each session will give you data that saves hours of misdirected AI usage over the following months.

$ ls ./related

Explore by topic