CodeWalker

Walk your codebase before writing new code.

CodeWalker is an MCP server that gives Claude Code real-time access to your Python codebase structure, enabling AI-assisted development that reuses existing code instead of duplicating it.

The Problem: AI Code Duplication

What Happens Without CodeWalker

When Claude Code writes code, it can't see what already exists in your codebase. This causes a cascade of problems:

Day 1: You ask Claude to add CSV loading functionality

# Claude creates: src/data_loader.py
def load_csv_file(path):
    return pd.read_csv(path)

Day 5: Different feature needs CSV loading

# Claude creates: src/importer.py (Claude has no memory of data_loader.py)
def load_csv_data(filepath):
    df = pd.read_csv(filepath)
    return df

Day 10: Another feature, another duplicate

# Claude creates: src/utils.py (Claude still doesn't know about the others)
def read_csv(file_path):
    return pd.read_csv(file_path, low_memory=False)  # Now with different behavior!

Result after 2 weeks:

🔴 7 different CSV loading functions across your codebase
🔴 Inconsistent behavior (one uses low_memory=False, others don't)
🔴 Impossible to maintain (bug fixes need to be applied 7 times)
🔴 Unpredictable behavior (which implementation gets called depends on imports)
🔴 Code review nightmare (reviewing duplicate implementations wastes time)

The Cost of Code Duplication

This isn't just messy - it's expensive:

Impact	Cost
Development Time	30-40% wasted rewriting existing code
Bug Fixes	Same bug appears in multiple places, fixed multiple times
Code Reviews	Reviewers waste time on duplicate implementations
Onboarding	New developers confused by inconsistent patterns
Technical Debt	Duplicates diverge over time, creating maintenance burden
Testing	Same logic tested multiple times (or worse, inconsistently)

Real Example: A codebase with 800 functions had 52.7% duplication rate - 422 functions were duplicates. That's thousands of wasted lines of code.

How CodeWalker Solves This

CodeWalker indexes your codebase and lets Claude search before writing:

With CodeWalker

Day 1: You ask Claude to add CSV loading

Claude (internal): Let me check if CSV loading already exists...
> search_functions("load csv")

Found: load_csv_file() in src/data_loader.py

Claude: "I found an existing CSV loader. Let me use it instead of creating a new one."

Result:

# Claude imports existing function
from src.data_loader import load_csv_file

data = load_csv_file(path)

Day 5, 10, 15...: Same pattern - Claude finds and reuses existing code

Result after 2 weeks:

✅ 1 canonical CSV loading function (not 7)
✅ Consistent behavior across entire codebase
✅ Easy to maintain (fix bugs once, fixed everywhere)
✅ Predictable behavior (one implementation = one behavior)
✅ Fast code reviews (reviewers see reuse, not duplication)

Why This Problem Exists

LLMs Lack Architectural Awareness

Claude Code (and all LLMs) have a fundamental limitation:

❌ Can't see your codebase structure ❌ Can't search across files ❌ Can't remember what exists ❌ Can't detect duplicates

The technical reason: When Claude writes code, it only sees:

The current file you're editing
Recent conversation context
Maybe a few related files you showed it

What Claude DOESN'T see:

That load_csv_file() already exists in src/data_loader.py
That 3 other files have similar functions
That your team has a canonical implementation
Your codebase architecture and patterns

Result: Claude invents new implementations instead of reusing existing ones.

The "10 Developers, 0 Communication" Problem

Working with AI without CodeWalker is like having 10 developers who never talk to each other:

Developer 1 (Monday):    Creates load_csv_file()
Developer 2 (Tuesday):   Doesn't know about it, creates load_csv_data()
Developer 3 (Wednesday): Doesn't know about either, creates read_csv()
Developer 4 (Thursday):  Creates import_csv()
... and so on

Each "developer" (AI session) works in isolation, creating duplicates because they can't see what others did.

CodeWalker fixes this by giving AI a "shared memory" of your entire codebase.

Real-World Impact

Case Study: Elisity Project

Before CodeWalker:

800 total functions
422 duplicates (52.7% duplication rate)
33 direct pd.read_csv() calls (should use centralized loader)
11 duplicate print_summary() implementations
3 duplicate load_flow_data() functions with diverging behavior

With CodeWalker:

Claude finds existing implementations before writing new code
Duplication rate drops to near-zero for new code
Codebase becomes more maintainable over time

Time Saved:

Development: 30-40% less time rewriting existing code
Code Review: Reviewers focus on new logic, not duplicate detection
Bug Fixes: Fix once instead of hunting down 3-7 duplicates

How It Works

Architecture

┌─────────────────────┐
│   Your Codebase     │
│  (Python files)     │
└──────────┬──────────┘
           │
           │ AST Parser extracts
           │ function metadata
           ▼
┌─────────────────────┐
│   SQLite Index      │
│  (functions.db)     │
│                     │
│  • Function names   │
│  • Parameters       │
│  • Locations        │
│  • Docstrings       │
└──────────┬──────────┘
           │
           │ Claude queries via
           │ MCP protocol
           ▼
┌─────────────────────┐
│   Claude Code       │
│                     │
│  "Does load_csv     │
│   already exist?"   │
│                     │
│  → Yes! Use it      │
└─────────────────────┘

What Gets Indexed

For each function in your codebase:

Name - load_csv_file
Location - src/data_loader.py:42
Parameters - (path, encoding='utf-8')
Docstring - First line for quick understanding
Type - Regular function, async function, or class method
Decorators - @staticmethod, @cached, etc.

What's NOT stored: Function bodies, comments, string literals (only structural metadata).

Search Performance

Parsing: ~100-200 files/second
Indexing: ~1000 functions/second
Search: Sub-millisecond SQLite queries
Database size: ~1 KB per function

Example: 800 functions = ~800 KB database, indexed in < 5 seconds, searched in < 1ms.

Features

🔍 Search Before Writing

Tool: search_functions(query, exact=False)

Find existing functions before Claude writes new code:

> search_functions("load csv")

Found 3 functions:

• load_csv_file(path, encoding='utf-8')
  Location: src/data_loader.py:42
  Docs: Load CSV file with proper encoding handling

• FlowDataLoader.load_flows(flow_path, site_label)
  Location: modules/flow_loader.py:98
  Docs: Load flow data from CSV with site labeling

• read_raw_csv(filepath)
  Location: legacy/importer.py:156
  Docs: Legacy CSV reader (deprecated)

Claude sees these results and chooses to import the canonical implementation instead of creating a new one.

🔁 Detect Duplicates

Tool: find_duplicates()

Find functions with the same name in multiple files:

> find_duplicates()

⚠️  Found 3 function names with multiple implementations:

**load_flow_data** (3 implementations):
  - cohesion_analyzer.py:253
  - legacy/community_detector.py:440
  - policy_group_clustering.py:497

**format_bytes** (2 implementations):
  - utils.py:88
  - helpers.py:124

💡 Recommendation: Consolidate into single canonical implementations.

Use this to audit your codebase and identify consolidation opportunities.

🎯 Similar Signatures

Tool: find_similar_signatures(min_params=2)

Find functions with the same parameters (might be doing the same thing):

> find_similar_signatures(min_params=2)

Found 2 signature groups:

**Signature: (data, output_path)** - 4 functions:
  • save_to_csv in exporter.py:67
  • write_csv_file in writer.py:134
  • export_data in utils.py:203
  • save_results in analyzer.py:445

💡 These functions likely do the same thing with different names.

Catches semantic duplicates - functions that do the same thing but have different names.

📂 Multi-Project Support

Work on multiple projects without reconfiguring:

# One-time setup
> register_project("project-a", "/Users/jose/Projects/project-a")
> register_project("project-b", "/Users/jose/Projects/project-b")

# Daily use - auto-detects from your current directory
cd ~/Projects/project-a
> search_functions("auth")
[Auto-detected: project-a]
Found 5 functions...

cd ~/Projects/project-b
> search_functions("auth")
[Auto-detected: project-b]
Found 3 functions...

Features:

✅ Register unlimited projects
✅ Auto-detection from working directory
✅ Isolated indexes (no cross-contamination)
✅ Zero configuration switching

📊 Codebase Statistics

Tool: get_index_stats()

Understand your codebase at a glance:

> get_index_stats()

📊 CodeWalker Statistics:

Total Functions: 800
Total Files: 60
Unique Names: 765
Methods: 423
Async Functions: 67
Avg Parameters: 2.3

Duplication Rate: 4.4% (35 duplicates)
Last Indexed: 2026-03-18 10:35:00

Track duplication rate over time to measure improvement.

Quick Start

1. Install

git clone https://github.com/[username]/codewalker.git
cd codewalker
pip install -r requirements.txt

2. Configure Claude Code

Add to ~/.config/claude-code/mcp.json:

{
  "mcpServers": {
    "codewalker": {
      "command": "python3",
      "args": ["/absolute/path/to/codewalker/src/server.py"]
    }
  }
}

3. Register Your Projects

Restart Claude Code, then:

> register_project("my-project", "/absolute/path/to/your/project")

🔄 Registering project: my-project
📁 Path: /absolute/path/to/your/project

⏳ Indexing project...
Found 800 functions

✅ Indexing complete!

Total Functions: 800
Total Files: 60
Unique Names: 765

4. Start Using

CodeWalker now automatically prevents duplicate code:

You: "Add functionality to load CSV files"

Claude (internal):
  > search_functions("load csv")
  Found: load_csv_file() in src/data_loader.py

Claude: "I found an existing CSV loader at src/data_loader.py:42.
Let me use that instead of creating a new one:

from src.data_loader import load_csv_file
data = load_csv_file(path)

Available Tools

Project Management

register_project(name, path) - Add a project to CodeWalker
list_projects() - View all registered projects
unregister_project(name) - Remove a project
get_current_project() - Show which project is detected

Function Search

search_functions(query, exact) - Find functions by name
find_duplicates() - Detect duplicate function names
find_similar_signatures(min_params) - Find functions with similar parameters
get_file_functions(file_path) - List all functions in a file
get_index_stats() - View codebase statistics
reindex_repository() - Rebuild index after major changes

Use Cases

1. Prevent Duplication During Development

Before every implementation:

You: "Add user authentication"

Claude: Let me check if auth code already exists...
> search_functions("auth")
Found: authenticate_user() in src/auth.py

Claude: "I found existing auth code. Let me use it..."

2. Onboard to New Codebases

Explore unfamiliar code:

> search_functions("export")
Found 12 functions with "export" in the name

> get_file_functions("src/exporter.py")
Lists all 8 functions in the file with signatures and docs

Quickly understand what exists before writing new code.

3. Refactoring and Cleanup

Find consolidation opportunities:

> find_duplicates()
Found 15 duplicate function names

> find_similar_signatures()
Found 8 signature groups (functions with same params)

Systematically eliminate duplication.

4. Code Review

Reviewers can verify reuse:

Reviewer: "Why didn't you use the existing loader?"

Developer: "Let me check..."
> search_functions("load")
Found 3 loaders I didn't know about!

Catch missed reuse opportunities during review.

Comparison: With vs Without CodeWalker

Scenario	Without CodeWalker	With CodeWalker
Add CSV loading	Creates 7th duplicate `load_csv()`	Finds and reuses existing `load_csv_file()`
Authentication needed	Creates new auth from scratch	Imports existing `authenticate_user()`
Format bytes	Creates 3rd `format_bytes()`	Uses canonical implementation
Code review	"Why is this duplicated?"	"Good reuse of existing code"
Bug in duplicates	Fix bug in 7 different places	Fix once, fixed everywhere
Onboarding	"Which loader should I use?"	Clear: one canonical implementation
Duplication rate	40-60% (typical for AI projects)	< 5% (with CodeWalker)

Graph Theory Connection

CodeWalker treats your codebase as a graph:

Vertices - Functions, classes, modules
Edges - Imports, function calls, dependencies
Walking - Traversing the graph to discover existing code

Graph concepts:

Graph walk - Sequence of vertices (functions) and edges (calls)
Traversal - Systematic exploration of the graph structure
Random walks - Discovery algorithms (like PageRank)
Tree walks - AST traversal (what the parser does)

This isn't just a metaphor - CodeWalker literally walks your Abstract Syntax Tree (AST) to build the function graph.

Roadmap & Future Development

CodeWalker v2.0.0 solves the core AI code duplication problem for Python projects. Future versions will add deeper analysis, broader language support, and smarter automation.

🔥 High Priority

Why these matter: These features provide immediate value for existing users and are most frequently requested.

Incremental indexing - Currently, reindexing rebuilds the entire database. Incremental indexing would only update changed files, making reindexing 10-100x faster for large codebases. Impact: Seconds instead of minutes for 10k+ function codebases.
Near-duplicate detection - Functions like load_csv, load_csv_data, and read_csv_file are semantically duplicates but have different names. Levenshtein distance matching would catch these "near-duplicates" that current exact/partial matching misses. Impact: Catch 20-30% more duplicates.
Cross-project search - Search across all registered projects simultaneously. Useful for teams with shared utilities across multiple repos or monorepo users who want to find reusable code anywhere. Impact: Prevent reinventing wheels across project boundaries.
Call graph analysis - Track what calls what to enable "blast radius" analysis ("what breaks if I change this function?") and identify unused code. Impact: Safer refactoring, dead code detection.

🎯 Medium Priority

Why these matter: These features enhance CodeWalker's intelligence and reduce manual effort.

Semantic similarity (ML-based) - Detect functions that do the same thing with completely different names and signatures using embedding-based similarity. Example: save_to_csv(data, path) and export_results(df, filename) might be doing the same thing. Impact: Catch duplicates current signature matching misses.
Auto-reindexing on file changes - Watch filesystem and automatically reindex when Python files change. No more manual reindex_repository() calls. Impact: Zero-maintenance index that's always current.
Multi-language support - Extend beyond Python to JavaScript, TypeScript, Go, Rust, Java. Same duplication prevention for polyglot codebases. Impact: Unified duplication prevention across entire stack.
Blast radius visualization - Show dependency trees and impact analysis when considering changes. "If I modify function X, these 15 functions are affected." Impact: Confident refactoring.

💡 Lower Priority

Why these matter: Nice-to-have features that improve developer experience but aren't critical to core functionality.

Web UI - Visual interface for browsing functions, viewing call graphs, and exploring codebase structure in a browser. Alternative to CLI-only workflow. Impact: Better onboarding experience, visual learners benefit.
VS Code extension - Native VS Code integration with inline suggestions ("⚠️ Similar function exists: use load_csv_file() instead"). Impact: Proactive duplicate prevention during typing.
Import suggestions - When Claude is about to write new code, automatically suggest existing imports. "You're about to write X, but Y already exists - import it?" Impact: Even less manual searching.
GitHub Action - CI/CD integration that fails PRs introducing duplicates above a threshold. Enforce duplication standards via automation. Impact: Prevent duplicates from ever being merged.

📊 Current Capabilities

What works today:

Language Support:

✅ Python (full support for functions, methods, async functions, decorators)
🚧 JavaScript, TypeScript, Go, Rust (on roadmap)

Analysis:

✅ Function names, signatures, locations, docstrings
✅ Parameter matching and signature comparison
✅ Duplicate detection (exact name matches)
🚧 Call graph analysis (planned)
🚧 Semantic similarity (planned)
🚧 Near-duplicate detection via Levenshtein distance (planned)

Indexing:

✅ Full repository indexing (~5 seconds for 800 functions)
✅ Manual reindexing on demand
🚧 Incremental updates (only changed files - planned)
🚧 Auto-reindexing on file changes (planned)

Search:

✅ Exact and partial name matching
✅ Parameter signature matching
✅ Multi-project support with auto-detection
🚧 Semantic search by behavior (planned)
🚧 Cross-project search (planned)

FAQ

Q: Does this work with other AI assistants?

Yes! CodeWalker uses the Model Context Protocol (MCP), which is an open standard. Any AI tool that supports MCP can use CodeWalker:

Claude Code (tested)
Claude Desktop (should work)
Other MCP-compatible tools

Q: How much overhead does indexing add?

Very little:

Initial indexing: ~5 seconds for 800 functions
Reindexing: ~5 seconds (full rebuild)
Search queries: < 1ms
Memory: ~10 MB for typical projects

You barely notice it's there.

Q: What if my codebase is huge?

CodeWalker scales well:

Tested on 800 functions / 60 files
Should handle 10,000+ functions easily (SQLite scales)
For massive codebases (100k+ functions), consider:
- Incremental indexing (planned feature)
- Multiple project registrations (already supported)
- Excluding test files or generated code

Q: Can I use this on proprietary code?

Yes! Everything is local:

✅ Index stored locally (~/.codewalker)
✅ No data sent to external services
✅ No network requests during search
✅ Your code never leaves your machine

CodeWalker is 100% private.

Q: How is this different from IDE autocomplete?

Complementary, not competing:

IDE autocomplete:

Works in single file
Shows available imports
Type-aware suggestions
Real-time as you type

CodeWalker:

Works across entire codebase
Searches by semantic intent ("load csv")
Finds duplicates proactively
Used by AI during code generation

Use both - IDE for writing, CodeWalker for AI-assisted development.

Q: What about private/internal functions?

CodeWalker indexes everything:

Public functions: ✅ Indexed
Private functions (_private): ✅ Indexed
Internal functions (__internal): ✅ Indexed

Why? Because you might want to reuse private functions too. Claude respects Python conventions (won't use _private from other modules without good reason), but knowing they exist prevents duplication.

Contributing

Contributions welcome! See CONTRIBUTING.md for guidelines.

Areas we need help:

Multi-language support (JavaScript, TypeScript, Go)
Incremental indexing
Semantic similarity detection
Performance optimization

License

MIT License - see LICENSE for details.

Free to use in personal and commercial projects.

Credits

Built to solve a real problem: Claude Code was creating duplicate implementations across a 60-file, 800-function codebase. CodeWalker eliminated the duplication.

Inspired by: Pharaoh (commercial tool for codebase intelligence)

Built with: Claude Sonnet 4.5 (dogfooding - using AI to build tools that improve AI)

Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Documentation: See guides in this repository

Summary

Problem: AI assistants can't see your codebase, causing massive code duplication.

Solution: CodeWalker indexes your codebase and lets AI search before writing.

Result: 40-60% reduction in duplicate code, faster development, cleaner codebase.

Get Started:

pip install -r requirements.txt
# Configure MCP (see Quick Start above)
> register_project("my-project", "/path/to/project")
> search_functions("whatever you're about to write")

Stop duplicating code. Start walking your codebase. 🚀

CodeWalker

CodeWalker

The Problem: AI Code Duplication

What Happens Without CodeWalker

The Cost of Code Duplication

How CodeWalker Solves This

With CodeWalker

Why This Problem Exists

LLMs Lack Architectural Awareness

The "10 Developers, 0 Communication" Problem

Real-World Impact

Case Study: Elisity Project

How It Works

Architecture

What Gets Indexed

Search Performance

Features

🔍 Search Before Writing

🔁 Detect Duplicates

🎯 Similar Signatures

📂 Multi-Project Support

📊 Codebase Statistics

Quick Start

1. Install

2. Configure Claude Code

3. Register Your Projects

4. Start Using

Available Tools

Project Management

Function Search

Use Cases

1. Prevent Duplication During Development

2. Onboard to New Codebases

3. Refactoring and Cleanup

4. Code Review

Comparison: With vs Without CodeWalker

Graph Theory Connection

Roadmap & Future Development

🔥 High Priority

🎯 Medium Priority

💡 Lower Priority

📊 Current Capabilities

FAQ

Q: Does this work with other AI assistants?

Q: How much overhead does indexing add?

Q: What if my codebase is huge?

Q: Can I use this on proprietary code?

Q: How is this different from IDE autocomplete?

Q: What about private/internal functions?

Contributing

License

Credits

Support

Summary

Reviews