The Real Impact of AI on Code Duplication
The Real Impact of AI on Code Duplication
I ran jscpd on 9 production codebases last quarter. The 5 teams using AI coding assistants daily had an average duplicate code ratio of 11.3%. The 4 teams without AI had an average of 3.8%. That's a 3x difference.
That number alone doesn't tell the full story. Duplication isn't always bad. Sometimes it's the right trade-off. But 11.3% isn't intentional duplication for decoupling. It's accidental duplication from a tool that doesn't know your code already exists.
Why AI Creates More Duplicates
The explanation is straightforward. When you ask AI to write a function, it generates a fresh implementation based on your prompt and its training data. It doesn't search your codebase for an existing utility that does the same thing. It doesn't check if there's already a helper function three directories away. It writes new code every time.
Here's my contrarian take: the duplication problem isn't an AI limitation. It's a codebase discoverability problem. If your own engineers can't find existing utilities quickly, you can't expect AI to find them either. The teams with the lowest duplication ratios, both with and without AI, are the ones with the best code organization and search tooling.
The Duplication Audit: What I Found
I dug into the 5 AI-heavy codebases to categorize the duplicates. The patterns were remarkably consistent:
| Duplication Category | % of All Duplicates | Avg Instances |
|---|---|---|
| Utility functions (date formatting, string helpers) | 34% | 6.2 per function |
| API request/response handling | 22% | 4.1 per pattern |
| Validation logic | 18% | 3.8 per schema |
| Error handling patterns | 14% | 5.3 per pattern |
| Database query patterns | 12% | 3.2 per query |
The worst offender was date formatting. One codebase had 14 different implementations of "format a date as relative time" (e.g., "3 hours ago"). Each one worked. Each one was slightly different. Some handled edge cases the others didn't. Nobody knew which was the "correct" version.
The Hidden Cost
Duplication has obvious costs: larger bundle size, more maintenance surface. But the hidden cost is worse. When you have 6 implementations of the same utility, and one of them has a bug, which ones do you fix? How do you know you found them all?
I tracked a real incident. A date parsing function had a timezone bug. The team fixed it in the utility file. But there were 4 other copies of similar logic scattered across the codebase. The bug persisted in production for 3 weeks because nobody realized the same logic existed elsewhere.
The math on this is bad. If a bug fix takes 1 hour and you have 6 copies of the code, you should spend 6 hours finding and fixing all copies. But most teams fix the one they found and move on. The other 5 copies become ticking time bombs.
The Duplication Prevention Framework
After watching this pattern repeat across 5 teams, I built a 3-layer defense against AI-introduced duplication.
Layer 1: Code Atlas (Before Generation)
Before AI generates new code, check if the functionality already exists. This sounds obvious, but nobody does it systematically.
Build a "Code Atlas" - a searchable index of your utilities and shared functions:
// scripts/build-code-atlas.ts
import { Project } from "ts-morph";
interface FunctionEntry {
name: string;
file: string;
description: string;
parameters: string[];
returnType: string;
}
function buildAtlas(): FunctionEntry[] {
const project = new Project({
tsConfigFilePath: "./tsconfig.json",
});
const entries: FunctionEntry[] = [];
// Index all exported functions from utility directories
const utilFiles = project.getSourceFiles("src/utils/**/*.ts");
const libFiles = project.getSourceFiles("src/lib/**/*.ts");
for (const file of [...utilFiles, ...libFiles]) {
const functions = file.getFunctions().filter(f => f.isExported());
for (const fn of functions) {
entries.push({
name: fn.getName() || "anonymous",
file: file.getFilePath(),
description: fn.getJsDocs().map(d =>
d.getDescription()).join(" "),
parameters: fn.getParameters().map(p =>
`${p.getName()}: ${p.getType().getText()}`),
returnType: fn.getReturnType().getText(),
});
}
}
return entries;
}Before prompting AI for new utility code, search your atlas. Include relevant existing functions in your prompt context. "We already have formatRelativeDate in /src/utils/dates.ts. Use it instead of creating a new one."
Layer 2: Duplicate Detection in CI
Run duplication analysis on every PR. Flag new duplicates before they merge.
# .github/workflows/duplication-check.yml
- name: Check for code duplication
run: |
npx jscpd src/ --min-lines 5 --min-tokens 40 \
--reporters json --output ./jscpd-report.json
# Compare with baseline
CURRENT=$(cat jscpd-report.json | jq '.statistics.total.percentage')
BASELINE=$(cat .metrics/duplication-baseline.json | jq '.percentage')
if (( $(echo "$CURRENT > $BASELINE + 0.5" | bc -l) )); then
echo "::error::Duplication increased from ${BASELINE}% to ${CURRENT}%"
exit 1
fiSet a duplication budget. Ours is 4%. Any PR that pushes the ratio above 4% gets flagged. The team decides whether the duplication is intentional (microservice independence) or accidental (AI didn't know about existing code).
Layer 3: Monthly Deduplication Sprint
Once a month, spend a half-day consolidating duplicates. Use the jscpd report to find clusters of similar code, then extract shared utilities.
This isn't glamorous work. But every function you deduplicate reduces future maintenance by the number of copies that existed. Deduplicate a function with 6 copies and you've reduced the maintenance surface by 83% for that logic.
The "AI-Aware" Utility Organization
Most codebases organize utilities by technical concern: utils/strings.ts, utils/dates.ts, utils/arrays.ts. This organization is terrible for AI discovery. When you prompt "write a function to format a date," the AI doesn't search your utils directory.
I've started organizing utilities by use case instead:
src/
utils/
formatting/
dates.ts (formatRelativeDate, formatISO, formatDisplay)
currency.ts (formatCurrency, parseCurrencyInput)
numbers.ts (formatCompact, formatPercentage)
validation/
email.ts (isValidEmail, normalizeEmail)
phone.ts (isValidPhone, formatE164)
password.ts (validatePasswordStrength)
http/
retry.ts (withRetry, exponentialBackoff)
errors.ts (toApiError, isNetworkError)
Then I include the directory structure in every AI prompt:
Our utility structure:
- src/utils/formatting/ - all formatting functions
- src/utils/validation/ - all validation functions
- src/utils/http/ - HTTP helpers
Check these directories before creating new utility functions.
Reuse existing functions where possible.
This single change reduced new duplicate introductions by 40% in the first month. The AI still doesn't search your codebase, but by including the directory map in your prompt, you give it the information it needs to reference existing code.
Measuring Progress
Track these metrics weekly:
| Metric | Red | Yellow | Green |
|---|---|---|---|
| Overall duplication ratio | > 8% | 4-8% | < 4% |
| New duplicates per PR | > 3 | 1-2 | 0 |
| Utility functions with 3+ copies | > 10 | 3-10 | < 3 |
| Time to find existing utility | > 5 min | 2-5 min | < 2 min |
That last metric matters most. If engineers can't find existing utilities in under 2 minutes, they'll let AI generate new ones. Invest in code search tooling and documentation for your shared libraries.
The Bottom Line
AI duplication isn't a tool problem. It's an organization problem. AI generates fresh code because it doesn't know your code exists. The solution isn't better AI. It's better codebase organization and prompting discipline.
Run jscpd on your codebase today. If your duplication ratio is above 8%, you have a problem worth fixing. Start with the Code Atlas, add CI detection, and schedule your first deduplication sprint. Three months from now, your codebase will be measurably cleaner, and your AI-generated code will start fitting in instead of reinventing what's already there.
$ ls ./related
Explore by topic