Codebase Understanding

Building a Knowledge Graph of Your Codebase

Vaibhav Verma

May 26, 2026

12 min read

knowledge-graphcodebase-understandingarchitecturedeveloper-toolscode-analysisengineering-leadership

Two years ago, a new engineer on my team asked me how our payment system worked. I started drawing on a whiteboard: "The API gateway calls the payment service, which talks to Stripe, and then events go to the billing service, which updates the ledger, and..." I was 10 minutes into the explanation and had filled two whiteboards. The engineer looked more confused, not less.

That's when I realized the problem. Our codebase had grown to a point where no single person could hold the full picture in their head. The architecture existed in fragments across dozens of engineers' mental models, outdated wiki pages, and tribal knowledge shared over lunch. We needed an externalized, queryable representation of how our system actually worked. We needed a knowledge graph.

What a Codebase Knowledge Graph Actually Is

A codebase knowledge graph is a structured representation of the entities in your codebase and the relationships between them. Think of it as the data model behind the whiteboard drawing, except it's generated from actual code rather than someone's memory.

Entities include:

Modules/packages: The organizational units of your code
Functions/classes: The building blocks
APIs/endpoints: The boundaries
Data models: The schemas and types
Dependencies: Both internal and external
Tests: And what they cover
Engineers: And what they own

Relationships include:

"Module A imports Module B"
"Function X calls Function Y"
"Endpoint /api/payments depends on Service Z"
"Engineer Jane is the primary contributor to Module A"
"Data model User is referenced by 47 functions across 12 modules"

typescript

// A simplified knowledge graph data structure
interface CodebaseNode {
  id: string;
  type: 'module' | 'function' | 'class' | 'endpoint' | 'model' | 'engineer';
  name: string;
  metadata: Record&#x3C;string, unknown>;
}

interface CodebaseEdge {
  source: string;
  target: string;
  relationship: 'imports' | 'calls' | 'depends_on' | 'owns' | 'tests' | 'references';
  weight: number;  // frequency or importance
}

interface CodebaseKnowledgeGraph {
  nodes: CodebaseNode[];
  edges: CodebaseEdge[];
}

The Contrarian Take: Static Analysis Alone Won't Cut It

Most attempts at building codebase knowledge rely on static analysis: parsing import statements, building ASTs, and tracing type references. That's necessary but insufficient. Static analysis tells you what the code could do. It doesn't tell you what the code actually does in production.

I've seen dependency graphs that show Module A imports Module B, but the actual call path in production goes A -> C -> D -> B because of runtime configuration and dependency injection. The static graph is technically correct and practically useless.

A real knowledge graph combines three data sources:

Static analysis: What the code says (imports, types, function signatures)
Runtime data: What actually happens (call traces, API logs, database queries)
Human context: Why it's this way (ADRs, commit messages, ownership data)

Building Your Knowledge Graph: Step by Step

Step 1: Extract Static Relationships (Week 1)

Start with what's easiest to extract: the import graph and call graph from static analysis.

typescript

// Using TypeScript compiler API to extract imports
import * as ts from "typescript";

function extractImports(sourceFile: ts.SourceFile): string[] {
  const imports: string[] = [];
  ts.forEachChild(sourceFile, (node) => {
    if (ts.isImportDeclaration(node)) {
      const moduleSpecifier = node.moduleSpecifier;
      if (ts.isStringLiteral(moduleSpecifier)) {
        imports.push(moduleSpecifier.text);
      }
    }
  });
  return imports;
}

// Build the import graph for your entire codebase
function buildImportGraph(projectPath: string): Map&#x3C;string, string[]> {
  const program = ts.createProgram([projectPath], {});
  const graph = new Map&#x3C;string, string[]>();

  for (const sourceFile of program.getSourceFiles()) {
    if (!sourceFile.isDeclarationFile) {
      graph.set(sourceFile.fileName, extractImports(sourceFile));
    }
  }
  return graph;
}

For a quicker start, use existing tools:

bash

# Generate a dependency graph with madge
npx madge --json src/ > dependency-graph.json

# Generate a visual graph
npx madge --image graph.svg src/

# Find circular dependencies (these are high-priority graph edges)
npx madge --circular src/

Step 2: Add Runtime Relationships (Weeks 2-3)

Static analysis misses runtime behavior. Add OpenTelemetry tracing to capture actual call paths:

typescript

// Instrument key service boundaries to capture runtime relationships
import { trace } from "@opentelemetry/api";

const tracer = trace.getTracer("knowledge-graph-builder");

export function instrumentedHandler(name: string, handler: Function) {
  return async (...args: unknown[]) => {
    const span = tracer.startSpan(name);
    try {
      const result = await handler(...args);
      span.setAttribute("status", "success");
      return result;
    } catch (error) {
      span.setAttribute("status", "error");
      throw error;
    } finally {
      span.end();
    }
  };
}

// Use distributed tracing data to build runtime call graph
// Export traces to Jaeger/Zipkin, then query for service relationships

The runtime graph often reveals surprises. On one project, tracing showed that our "notification service" was actually the most-called service in the system because 14 other services sent events through it. The static analysis only showed 3 direct imports.

Step 3: Layer in Human Context (Week 3-4)

The most valuable part of the knowledge graph is the human layer. This comes from:

Git history analysis:

bash

# Extract ownership information: who contributed most to each module
git log --format='%ae' --name-only -- src/ | \
  awk '/^$/{next} /@/{author=$0; next} {print author, $0}' | \
  sort | uniq -c | sort -rn > ownership-data.txt

# Extract change coupling: files that change together
git log --oneline --name-only --since="6 months ago" | \
  awk '/^[a-f0-9]/{if(NR>1)print "---"; next} {print}' | \
  awk -v RS="---" '{n=split($0,files,"\n"); for(i=1;i&#x3C;=n;i++) for(j=i+1;j&#x3C;=n;j++) if(files[i] &#x26;&#x26; files[j]) print files[i], files[j]}' | \
  sort | uniq -c | sort -rn | head -20

Change coupling is incredibly revealing. Files that frequently change together have an implicit relationship that static analysis can't see. If payment-processor.ts and email-templates.ts change together in 80% of commits, there's a hidden dependency your architecture diagram doesn't show.

ADR references:

Link Architecture Decision Records to the modules they affect. This turns "why is this code like this?" from a Slack question into a graph query.

Step 4: Store and Query the Graph (Week 4)

For small-to-medium codebases, a JSON file with graph algorithms is sufficient:

typescript

// Simple graph queries without a graph database
class CodebaseGraph {
  private adjacency: Map&#x3C;string, Set&#x3C;string>> = new Map();
  private nodeData: Map&#x3C;string, CodebaseNode> = new Map();

  addNode(node: CodebaseNode): void {
    this.nodeData.set(node.id, node);
    if (!this.adjacency.has(node.id)) {
      this.adjacency.set(node.id, new Set());
    }
  }

  addEdge(source: string, target: string): void {
    this.adjacency.get(source)?.add(target);
  }

  // "What depends on this module?"
  dependents(nodeId: string): string[] {
    const results: string[] = [];
    for (const [source, targets] of this.adjacency) {
      if (targets.has(nodeId)) results.push(source);
    }
    return results;
  }

  // "What's the blast radius of changing this file?"
  blastRadius(nodeId: string, depth: number = 2): Set&#x3C;string> {
    const visited = new Set&#x3C;string>();
    const queue: [string, number][] = [[nodeId, 0]];

    while (queue.length > 0) {
      const [current, currentDepth] = queue.shift()!;
      if (currentDepth > depth || visited.has(current)) continue;
      visited.add(current);

      const dependentNodes = this.dependents(current);
      for (const dep of dependentNodes) {
        queue.push([dep, currentDepth + 1]);
      }
    }
    return visited;
  }

  // "Who should review changes to this module?"
  suggestReviewers(moduleId: string): string[] {
    // Query git ownership data for the module and its dependents
    const affectedModules = this.blastRadius(moduleId, 1);
    // Return engineers who own the affected modules
    return [...affectedModules]
      .map(id => this.nodeData.get(id))
      .filter(node => node?.type === 'engineer')
      .map(node => node!.name);
  }
}

For larger codebases (100K+ lines, 50+ engineers), consider a proper graph database like Neo4j:

cypher

// Neo4j queries for codebase knowledge

// Find the blast radius of changing a module
MATCH path = (m:Module {name: "payments"})&#x3C;-[:DEPENDS_ON*1..3]-(dependent)
RETURN dependent.name, length(path) as distance
ORDER BY distance;

// Find knowledge silos: modules with only one contributor
MATCH (e:Engineer)-[:CONTRIBUTES_TO]->(m:Module)
WITH m, collect(e.name) as contributors
WHERE size(contributors) = 1
RETURN m.name, contributors[0] as soleContributor;

// Find hidden coupling: modules that always change together
MATCH (m1:Module)-[c:CHANGES_WITH]->(m2:Module)
WHERE c.frequency > 0.7 AND NOT (m1)-[:DEPENDS_ON]->(m2)
RETURN m1.name, m2.name, c.frequency as couplingStrength
ORDER BY couplingStrength DESC;

Practical Applications

Application 1: Impact Analysis Before Changes

Before making a change, query the graph for blast radius:

"If I change the User model, what endpoints, services, and tests are affected?"

On one team, this reduced production incidents from cross-cutting changes by 60% in the first quarter. Engineers could see the full impact before writing code, not after deploying it.

Application 2: Intelligent Code Review Assignment

Instead of round-robin review assignment, query the graph for engineers who own the affected modules and their dependencies. PRs get reviewed by the people who actually understand the code.

Application 3: Onboarding Acceleration

New engineers get a personalized view of the codebase starting from their team's modules, expanding outward along dependency edges. Instead of "here's the entire codebase, good luck," they get "here are the 8 modules your team owns, here's how they connect to 5 other modules, and here's who to ask about each one."

Application 4: Architecture Fitness Functions

Use the graph to enforce architectural rules:

typescript

// Fitness function: UI modules should never directly import database modules
function checkLayerViolations(graph: CodebaseGraph): string[] {
  const violations: string[] = [];
  const uiModules = graph.getNodesByTag("layer:ui");
  const dbModules = graph.getNodesByTag("layer:database");

  for (const ui of uiModules) {
    for (const db of dbModules) {
      if (graph.hasEdge(ui.id, db.id)) {
        violations.push(`Layer violation: ${ui.name} directly imports ${db.name}`);
      }
    }
  }
  return violations;
}
// Run this in CI. Fail the build on violations.

The Stealable Framework: The GRAPH Playbook

Here's the 4-week plan to build your first codebase knowledge graph:

Week 1 - Generate: Run madge or a custom AST parser to extract the static dependency graph. Export as JSON. Visualize it. Share with the team. Just seeing the graph often triggers "I didn't know those were connected" moments.

Week 2 - Enrich: Add git ownership data and change coupling analysis. Identify knowledge monopolies and hidden dependencies.

Week 3 - Query: Build the 5 most useful queries: blast radius, ownership lookup, dependency depth, change coupling, and circular dependency detection.

Week 4 - Integrate: Add the most impactful query to your development workflow. I recommend starting with blast radius analysis in CI: when a PR touches a file, automatically comment with the list of potentially affected modules.

The knowledge graph doesn't need to be perfect. A rough graph that covers 80% of your codebase is infinitely more useful than no graph at all. Start with static analysis, add runtime and human data over time, and let the graph grow as your understanding grows.

The codebase you're responsible for is a living system with thousands of interconnected parts. You wouldn't run a city without a map. Don't run a codebase without a knowledge graph.