Building a Knowledge Graph of Your Codebase
Two years ago, a new engineer on my team asked me how our payment system worked. I started drawing on a whiteboard: "The API gateway calls the payment service, which talks to Stripe, and then events go to the billing service, which updates the ledger, and..." I was 10 minutes into the explanation and had filled two whiteboards. The engineer looked more confused, not less.
That's when I realized the problem. Our codebase had grown to a point where no single person could hold the full picture in their head. The architecture existed in fragments across dozens of engineers' mental models, outdated wiki pages, and tribal knowledge shared over lunch. We needed an externalized, queryable representation of how our system actually worked. We needed a knowledge graph.
What a Codebase Knowledge Graph Actually Is
A codebase knowledge graph is a structured representation of the entities in your codebase and the relationships between them. Think of it as the data model behind the whiteboard drawing, except it's generated from actual code rather than someone's memory.
Entities include:
- Modules/packages: The organizational units of your code
- Functions/classes: The building blocks
- APIs/endpoints: The boundaries
- Data models: The schemas and types
- Dependencies: Both internal and external
- Tests: And what they cover
- Engineers: And what they own
Relationships include:
- "Module A imports Module B"
- "Function X calls Function Y"
- "Endpoint /api/payments depends on Service Z"
- "Engineer Jane is the primary contributor to Module A"
- "Data model User is referenced by 47 functions across 12 modules"
// A simplified knowledge graph data structure
interface CodebaseNode {
id: string;
type: 'module' | 'function' | 'class' | 'endpoint' | 'model' | 'engineer';
name: string;
metadata: Record<string, unknown>;
}
interface CodebaseEdge {
source: string;
target: string;
relationship: 'imports' | 'calls' | 'depends_on' | 'owns' | 'tests' | 'references';
weight: number; // frequency or importance
}
interface CodebaseKnowledgeGraph {
nodes: CodebaseNode[];
edges: CodebaseEdge[];
}The Contrarian Take: Static Analysis Alone Won't Cut It
Most attempts at building codebase knowledge rely on static analysis: parsing import statements, building ASTs, and tracing type references. That's necessary but insufficient. Static analysis tells you what the code could do. It doesn't tell you what the code actually does in production.
I've seen dependency graphs that show Module A imports Module B, but the actual call path in production goes A -> C -> D -> B because of runtime configuration and dependency injection. The static graph is technically correct and practically useless.
A real knowledge graph combines three data sources:
- Static analysis: What the code says (imports, types, function signatures)
- Runtime data: What actually happens (call traces, API logs, database queries)
- Human context: Why it's this way (ADRs, commit messages, ownership data)
Building Your Knowledge Graph: Step by Step
Step 1: Extract Static Relationships (Week 1)
Start with what's easiest to extract: the import graph and call graph from static analysis.
// Using TypeScript compiler API to extract imports
import * as ts from "typescript";
function extractImports(sourceFile: ts.SourceFile): string[] {
const imports: string[] = [];
ts.forEachChild(sourceFile, (node) => {
if (ts.isImportDeclaration(node)) {
const moduleSpecifier = node.moduleSpecifier;
if (ts.isStringLiteral(moduleSpecifier)) {
imports.push(moduleSpecifier.text);
}
}
});
return imports;
}
// Build the import graph for your entire codebase
function buildImportGraph(projectPath: string): Map<string, string[]> {
const program = ts.createProgram([projectPath], {});
const graph = new Map<string, string[]>();
for (const sourceFile of program.getSourceFiles()) {
if (!sourceFile.isDeclarationFile) {
graph.set(sourceFile.fileName, extractImports(sourceFile));
}
}
return graph;
}For a quicker start, use existing tools:
# Generate a dependency graph with madge
npx madge --json src/ > dependency-graph.json
# Generate a visual graph
npx madge --image graph.svg src/
# Find circular dependencies (these are high-priority graph edges)
npx madge --circular src/Step 2: Add Runtime Relationships (Weeks 2-3)
Static analysis misses runtime behavior. Add OpenTelemetry tracing to capture actual call paths:
// Instrument key service boundaries to capture runtime relationships
import { trace } from "@opentelemetry/api";
const tracer = trace.getTracer("knowledge-graph-builder");
export function instrumentedHandler(name: string, handler: Function) {
return async (...args: unknown[]) => {
const span = tracer.startSpan(name);
try {
const result = await handler(...args);
span.setAttribute("status", "success");
return result;
} catch (error) {
span.setAttribute("status", "error");
throw error;
} finally {
span.end();
}
};
}
// Use distributed tracing data to build runtime call graph
// Export traces to Jaeger/Zipkin, then query for service relationshipsThe runtime graph often reveals surprises. On one project, tracing showed that our "notification service" was actually the most-called service in the system because 14 other services sent events through it. The static analysis only showed 3 direct imports.
Step 3: Layer in Human Context (Week 3-4)
The most valuable part of the knowledge graph is the human layer. This comes from:
Git history analysis:
# Extract ownership information: who contributed most to each module
git log --format='%ae' --name-only -- src/ | \
awk '/^$/{next} /@/{author=$0; next} {print author, $0}' | \
sort | uniq -c | sort -rn > ownership-data.txt
# Extract change coupling: files that change together
git log --oneline --name-only --since="6 months ago" | \
awk '/^[a-f0-9]/{if(NR>1)print "---"; next} {print}' | \
awk -v RS="---" '{n=split($0,files,"\n"); for(i=1;i<=n;i++) for(j=i+1;j<=n;j++) if(files[i] && files[j]) print files[i], files[j]}' | \
sort | uniq -c | sort -rn | head -20Change coupling is incredibly revealing. Files that frequently change together have an implicit relationship that static analysis can't see. If payment-processor.ts and email-templates.ts change together in 80% of commits, there's a hidden dependency your architecture diagram doesn't show.
ADR references:
Link Architecture Decision Records to the modules they affect. This turns "why is this code like this?" from a Slack question into a graph query.
Step 4: Store and Query the Graph (Week 4)
For small-to-medium codebases, a JSON file with graph algorithms is sufficient:
// Simple graph queries without a graph database
class CodebaseGraph {
private adjacency: Map<string, Set<string>> = new Map();
private nodeData: Map<string, CodebaseNode> = new Map();
addNode(node: CodebaseNode): void {
this.nodeData.set(node.id, node);
if (!this.adjacency.has(node.id)) {
this.adjacency.set(node.id, new Set());
}
}
addEdge(source: string, target: string): void {
this.adjacency.get(source)?.add(target);
}
// "What depends on this module?"
dependents(nodeId: string): string[] {
const results: string[] = [];
for (const [source, targets] of this.adjacency) {
if (targets.has(nodeId)) results.push(source);
}
return results;
}
// "What's the blast radius of changing this file?"
blastRadius(nodeId: string, depth: number = 2): Set<string> {
const visited = new Set<string>();
const queue: [string, number][] = [[nodeId, 0]];
while (queue.length > 0) {
const [current, currentDepth] = queue.shift()!;
if (currentDepth > depth || visited.has(current)) continue;
visited.add(current);
const dependentNodes = this.dependents(current);
for (const dep of dependentNodes) {
queue.push([dep, currentDepth + 1]);
}
}
return visited;
}
// "Who should review changes to this module?"
suggestReviewers(moduleId: string): string[] {
// Query git ownership data for the module and its dependents
const affectedModules = this.blastRadius(moduleId, 1);
// Return engineers who own the affected modules
return [...affectedModules]
.map(id => this.nodeData.get(id))
.filter(node => node?.type === 'engineer')
.map(node => node!.name);
}
}For larger codebases (100K+ lines, 50+ engineers), consider a proper graph database like Neo4j:
// Neo4j queries for codebase knowledge
// Find the blast radius of changing a module
MATCH path = (m:Module {name: "payments"})<-[:DEPENDS_ON*1..3]-(dependent)
RETURN dependent.name, length(path) as distance
ORDER BY distance;
// Find knowledge silos: modules with only one contributor
MATCH (e:Engineer)-[:CONTRIBUTES_TO]->(m:Module)
WITH m, collect(e.name) as contributors
WHERE size(contributors) = 1
RETURN m.name, contributors[0] as soleContributor;
// Find hidden coupling: modules that always change together
MATCH (m1:Module)-[c:CHANGES_WITH]->(m2:Module)
WHERE c.frequency > 0.7 AND NOT (m1)-[:DEPENDS_ON]->(m2)
RETURN m1.name, m2.name, c.frequency as couplingStrength
ORDER BY couplingStrength DESC;Practical Applications
Application 1: Impact Analysis Before Changes
Before making a change, query the graph for blast radius:
"If I change the User model, what endpoints, services, and tests are affected?"
On one team, this reduced production incidents from cross-cutting changes by 60% in the first quarter. Engineers could see the full impact before writing code, not after deploying it.
Application 2: Intelligent Code Review Assignment
Instead of round-robin review assignment, query the graph for engineers who own the affected modules and their dependencies. PRs get reviewed by the people who actually understand the code.
Application 3: Onboarding Acceleration
New engineers get a personalized view of the codebase starting from their team's modules, expanding outward along dependency edges. Instead of "here's the entire codebase, good luck," they get "here are the 8 modules your team owns, here's how they connect to 5 other modules, and here's who to ask about each one."
Application 4: Architecture Fitness Functions
Use the graph to enforce architectural rules:
// Fitness function: UI modules should never directly import database modules
function checkLayerViolations(graph: CodebaseGraph): string[] {
const violations: string[] = [];
const uiModules = graph.getNodesByTag("layer:ui");
const dbModules = graph.getNodesByTag("layer:database");
for (const ui of uiModules) {
for (const db of dbModules) {
if (graph.hasEdge(ui.id, db.id)) {
violations.push(`Layer violation: ${ui.name} directly imports ${db.name}`);
}
}
}
return violations;
}
// Run this in CI. Fail the build on violations.The Stealable Framework: The GRAPH Playbook
Here's the 4-week plan to build your first codebase knowledge graph:
Week 1 - Generate: Run madge or a custom AST parser to extract the static dependency graph. Export as JSON. Visualize it. Share with the team. Just seeing the graph often triggers "I didn't know those were connected" moments.
Week 2 - Enrich: Add git ownership data and change coupling analysis. Identify knowledge monopolies and hidden dependencies.
Week 3 - Query: Build the 5 most useful queries: blast radius, ownership lookup, dependency depth, change coupling, and circular dependency detection.
Week 4 - Integrate: Add the most impactful query to your development workflow. I recommend starting with blast radius analysis in CI: when a PR touches a file, automatically comment with the list of potentially affected modules.
The knowledge graph doesn't need to be perfect. A rough graph that covers 80% of your codebase is infinitely more useful than no graph at all. Start with static analysis, add runtime and human data over time, and let the graph grow as your understanding grows.
The codebase you're responsible for is a living system with thousands of interconnected parts. You wouldn't run a city without a map. Don't run a codebase without a knowledge graph.
$ ls ./related
Explore by topic