CodeWalker
Walk your codebase before writing new code.
CodeWalker is an MCP server that gives Claude Code real-time access to your Python codebase structure, enabling AI-assisted development that reuses existing code instead of duplicating it.
The Problem: AI Code Duplication
What Happens Without CodeWalker
When Claude Code writes code, it can't see what already exists in your codebase. This causes a cascade of problems:
Day 1: You ask Claude to add CSV loading functionality
# Claude creates: src/data_loader.py
def load_csv_file(path):
return pd.read_csv(path)
Day 5: Different feature needs CSV loading
# Claude creates: src/importer.py (Claude has no memory of data_loader.py)
def load_csv_data(filepath):
df = pd.read_csv(filepath)
return df
Day 10: Another feature, another duplicate
# Claude creates: src/utils.py (Claude still doesn't know about the others)
def read_csv(file_path):
return pd.read_csv(file_path, low_memory=False) # Now with different behavior!
Result after 2 weeks:
- 🔴 7 different CSV loading functions across your codebase
- 🔴 Inconsistent behavior (one uses
low_memory=False, others don't) - 🔴 Impossible to maintain (bug fixes need to be applied 7 times)
- 🔴 Unpredictable behavior (which implementation gets called depends on imports)
- 🔴 Code review nightmare (reviewing duplicate implementations wastes time)
The Cost of Code Duplication
This isn't just messy - it's expensive:
| Impact | Cost |
|---|---|
| Development Time | 30-40% wasted rewriting existing code |
| Bug Fixes | Same bug appears in multiple places, fixed multiple times |
| Code Reviews | Reviewers waste time on duplicate implementations |
| Onboarding | New developers confused by inconsistent patterns |
| Technical Debt | Duplicates diverge over time, creating maintenance burden |
| Testing | Same logic tested multiple times (or worse, inconsistently) |
Real Example: A codebase with 800 functions had 52.7% duplication rate - 422 functions were duplicates. That's thousands of wasted lines of code.
How CodeWalker Solves This
CodeWalker indexes your codebase and lets Claude search before writing:
With CodeWalker
Day 1: You ask Claude to add CSV loading
Claude (internal): Let me check if CSV loading already exists...
> search_functions("load csv")
Found: load_csv_file() in src/data_loader.py
Claude: "I found an existing CSV loader. Let me use it instead of creating a new one."
Result:
# Claude imports existing function
from src.data_loader import load_csv_file
data = load_csv_file(path)
Day 5, 10, 15...: Same pattern - Claude finds and reuses existing code
Result after 2 weeks:
- ✅ 1 canonical CSV loading function (not 7)
- ✅ Consistent behavior across entire codebase
- ✅ Easy to maintain (fix bugs once, fixed everywhere)
- ✅ Predictable behavior (one implementation = one behavior)
- ✅ Fast code reviews (reviewers see reuse, not duplication)
Why This Problem Exists
LLMs Lack Architectural Awareness
Claude Code (and all LLMs) have a fundamental limitation:
❌ Can't see your codebase structure ❌ Can't search across files ❌ Can't remember what exists ❌ Can't detect duplicates
The technical reason: When Claude writes code, it only sees:
- The current file you're editing
- Recent conversation context
- Maybe a few related files you showed it
What Claude DOESN'T see:
- That
load_csv_file()already exists insrc/data_loader.py - That 3 other files have similar functions
- That your team has a canonical implementation
- Your codebase architecture and patterns
Result: Claude invents new implementations instead of reusing existing ones.
The "10 Developers, 0 Communication" Problem
Working with AI without CodeWalker is like having 10 developers who never talk to each other:
Developer 1 (Monday): Creates load_csv_file()
Developer 2 (Tuesday): Doesn't know about it, creates load_csv_data()
Developer 3 (Wednesday): Doesn't know about either, creates read_csv()
Developer 4 (Thursday): Creates import_csv()
... and so on
Each "developer" (AI session) works in isolation, creating duplicates because they can't see what others did.
CodeWalker fixes this by giving AI a "shared memory" of your entire codebase.
Real-World Impact
Case Study: Elisity Project
Before CodeWalker:
- 800 total functions
- 422 duplicates (52.7% duplication rate)
- 33 direct
pd.read_csv()calls (should use centralized loader) - 11 duplicate
print_summary()implementations - 3 duplicate
load_flow_data()functions with diverging behavior
With CodeWalker:
- Claude finds existing implementations before writing new code
- Duplication rate drops to near-zero for new code
- Codebase becomes more maintainable over time
Time Saved:
- Development: 30-40% less time rewriting existing code
- Code Review: Reviewers focus on new logic, not duplicate detection
- Bug Fixes: Fix once instead of hunting down 3-7 duplicates
How It Works
Architecture
┌─────────────────────┐
│ Your Codebase │
│ (Python files) │
└──────────┬──────────┘
│
│ AST Parser extracts
│ function metadata
▼
┌─────────────────────┐
│ SQLite Index │
│ (functions.db) │
│ │
│ • Function names │
│ • Parameters │
│ • Locations │
│ • Docstrings │
└──────────┬──────────┘
│
│ Claude queries via
│ MCP protocol
▼
┌─────────────────────┐
│ Claude Code │
│ │
│ "Does load_csv │
│ already exist?" │
│ │
│ → Yes! Use it │
└─────────────────────┘
What Gets Indexed
For each function in your codebase:
- Name -
load_csv_file - Location -
src/data_loader.py:42 - Parameters -
(path, encoding='utf-8') - Docstring - First line for quick understanding
- Type - Regular function, async function, or class method
- Decorators -
@staticmethod,@cached, etc.
What's NOT stored: Function bodies, comments, string literals (only structural metadata).
Search Performance
- Parsing: ~100-200 files/second
- Indexing: ~1000 functions/second
- Search: Sub-millisecond SQLite queries
- Database size: ~1 KB per function
Example: 800 functions = ~800 KB database, indexed in < 5 seconds, searched in < 1ms.
Features
🔍 Search Before Writing
Tool: search_functions(query, exact=False)
Find existing functions before Claude writes new code:
> search_functions("load csv")
Found 3 functions:
• load_csv_file(path, encoding='utf-8')
Location: src/data_loader.py:42
Docs: Load CSV file with proper encoding handling
• FlowDataLoader.load_flows(flow_path, site_label)
Location: modules/flow_loader.py:98
Docs: Load flow data from CSV with site labeling
• read_raw_csv(filepath)
Location: legacy/importer.py:156
Docs: Legacy CSV reader (deprecated)
Claude sees these results and chooses to import the canonical implementation instead of creating a new one.
🔁 Detect Duplicates
Tool: find_duplicates()
Find functions with the same name in multiple files:
> find_duplicates()
⚠️ Found 3 function names with multiple implementations:
**load_flow_data** (3 implementations):
- cohesion_analyzer.py:253
- legacy/community_detector.py:440
- policy_group_clustering.py:497
**format_bytes** (2 implementations):
- utils.py:88
- helpers.py:124
💡 Recommendation: Consolidate into single canonical implementations.
Use this to audit your codebase and identify consolidation opportunities.
🎯 Similar Signatures
Tool: find_similar_signatures(min_params=2)
Find functions with the same parameters (might be doing the same thing):
> find_similar_signatures(min_params=2)
Found 2 signature groups:
**Signature: (data, output_path)** - 4 functions:
• save_to_csv in exporter.py:67
• write_csv_file in writer.py:134
• export_data in utils.py:203
• save_results in analyzer.py:445
💡 These functions likely do the same thing with different names.
Catches semantic duplicates - functions that do the same thing but have different names.
📂 Multi-Project Support
Work on multiple projects without reconfiguring:
# One-time setup
> register_project("project-a", "/Users/jose/Projects/project-a")
> register_project("project-b", "/Users/jose/Projects/project-b")
# Daily use - auto-detects from your current directory
cd ~/Projects/project-a
> search_functions("auth")
[Auto-detected: project-a]
Found 5 functions...
cd ~/Projects/project-b
> search_functions("auth")
[Auto-detected: project-b]
Found 3 functions...
Features:
- ✅ Register unlimited projects
- ✅ Auto-detection from working directory
- ✅ Isolated indexes (no cross-contamination)
- ✅ Zero configuration switching
📊 Codebase Statistics
Tool: get_index_stats()
Understand your codebase at a glance:
> get_index_stats()
📊 CodeWalker Statistics:
Total Functions: 800
Total Files: 60
Unique Names: 765
Methods: 423
Async Functions: 67
Avg Parameters: 2.3
Duplication Rate: 4.4% (35 duplicates)
Last Indexed: 2026-03-18 10:35:00
Track duplication rate over time to measure improvement.
Quick Start
1. Install
git clone https://github.com/[username]/codewalker.git
cd codewalker
pip install -r requirements.txt
2. Configure Claude Code
Add to ~/.config/claude-code/mcp.json:
{
"mcpServers": {
"codewalker": {
"command": "python3",
"args": ["/absolute/path/to/codewalker/src/server.py"]
}
}
}
3. Register Your Projects
Restart Claude Code, then:
> register_project("my-project", "/absolute/path/to/your/project")
🔄 Registering project: my-project
📁 Path: /absolute/path/to/your/project
⏳ Indexing project...
Found 800 functions
✅ Indexing complete!
Total Functions: 800
Total Files: 60
Unique Names: 765
4. Start Using
CodeWalker now automatically prevents duplicate code:
You: "Add functionality to load CSV files"
Claude (internal):
> search_functions("load csv")
Found: load_csv_file() in src/data_loader.py
Claude: "I found an existing CSV loader at src/data_loader.py:42.
Let me use that instead of creating a new one:
from src.data_loader import load_csv_file
data = load_csv_file(path)
Available Tools
Project Management
register_project(name, path)- Add a project to CodeWalkerlist_projects()- View all registered projectsunregister_project(name)- Remove a projectget_current_project()- Show which project is detected
Function Search
search_functions(query, exact)- Find functions by namefind_duplicates()- Detect duplicate function namesfind_similar_signatures(min_params)- Find functions with similar parametersget_file_functions(file_path)- List all functions in a fileget_index_stats()- View codebase statisticsreindex_repository()- Rebuild index after major changes
Use Cases
1. Prevent Duplication During Development
Before every implementation:
You: "Add user authentication"
Claude: Let me check if auth code already exists...
> search_functions("auth")
Found: authenticate_user() in src/auth.py
Claude: "I found existing auth code. Let me use it..."
2. Onboard to New Codebases
Explore unfamiliar code:
> search_functions("export")
Found 12 functions with "export" in the name
> get_file_functions("src/exporter.py")
Lists all 8 functions in the file with signatures and docs
Quickly understand what exists before writing new code.
3. Refactoring and Cleanup
Find consolidation opportunities:
> find_duplicates()
Found 15 duplicate function names
> find_similar_signatures()
Found 8 signature groups (functions with same params)
Systematically eliminate duplication.
4. Code Review
Reviewers can verify reuse:
Reviewer: "Why didn't you use the existing loader?"
Developer: "Let me check..."
> search_functions("load")
Found 3 loaders I didn't know about!
Catch missed reuse opportunities during review.
Comparison: With vs Without CodeWalker
| Scenario | Without CodeWalker | With CodeWalker |
|---|---|---|
| Add CSV loading | Creates 7th duplicate load_csv() | Finds and reuses existing load_csv_file() |
| Authentication needed | Creates new auth from scratch | Imports existing authenticate_user() |
| Format bytes | Creates 3rd format_bytes() | Uses canonical implementation |
| Code review | "Why is this duplicated?" | "Good reuse of existing code" |
| Bug in duplicates | Fix bug in 7 different places | Fix once, fixed everywhere |
| Onboarding | "Which loader should I use?" | Clear: one canonical implementation |
| Duplication rate | 40-60% (typical for AI projects) | < 5% (with CodeWalker) |
Graph Theory Connection
CodeWalker treats your codebase as a graph:
- Vertices - Functions, classes, modules
- Edges - Imports, function calls, dependencies
- Walking - Traversing the graph to discover existing code
Graph concepts:
- Graph walk - Sequence of vertices (functions) and edges (calls)
- Traversal - Systematic exploration of the graph structure
- Random walks - Discovery algorithms (like PageRank)
- Tree walks - AST traversal (what the parser does)
This isn't just a metaphor - CodeWalker literally walks your Abstract Syntax Tree (AST) to build the function graph.
Roadmap & Future Development
CodeWalker v2.0.0 solves the core AI code duplication problem for Python projects. Future versions will add deeper analysis, broader language support, and smarter automation.
🔥 High Priority
Why these matter: These features provide immediate value for existing users and are most frequently requested.
-
Incremental indexing - Currently, reindexing rebuilds the entire database. Incremental indexing would only update changed files, making reindexing 10-100x faster for large codebases. Impact: Seconds instead of minutes for 10k+ function codebases.
-
Near-duplicate detection - Functions like
load_csv,load_csv_data, andread_csv_fileare semantically duplicates but have different names. Levenshtein distance matching would catch these "near-duplicates" that current exact/partial matching misses. Impact: Catch 20-30% more duplicates. -
Cross-project search - Search across all registered projects simultaneously. Useful for teams with shared utilities across multiple repos or monorepo users who want to find reusable code anywhere. Impact: Prevent reinventing wheels across project boundaries.
-
Call graph analysis - Track what calls what to enable "blast radius" analysis ("what breaks if I change this function?") and identify unused code. Impact: Safer refactoring, dead code detection.
🎯 Medium Priority
Why these matter: These features enhance CodeWalker's intelligence and reduce manual effort.
-
Semantic similarity (ML-based) - Detect functions that do the same thing with completely different names and signatures using embedding-based similarity. Example:
save_to_csv(data, path)andexport_results(df, filename)might be doing the same thing. Impact: Catch duplicates current signature matching misses. -
Auto-reindexing on file changes - Watch filesystem and automatically reindex when Python files change. No more manual
reindex_repository()calls. Impact: Zero-maintenance index that's always current. -
Multi-language support - Extend beyond Python to JavaScript, TypeScript, Go, Rust, Java. Same duplication prevention for polyglot codebases. Impact: Unified duplication prevention across entire stack.
-
Blast radius visualization - Show dependency trees and impact analysis when considering changes. "If I modify function X, these 15 functions are affected." Impact: Confident refactoring.
💡 Lower Priority
Why these matter: Nice-to-have features that improve developer experience but aren't critical to core functionality.
-
Web UI - Visual interface for browsing functions, viewing call graphs, and exploring codebase structure in a browser. Alternative to CLI-only workflow. Impact: Better onboarding experience, visual learners benefit.
-
VS Code extension - Native VS Code integration with inline suggestions ("⚠️ Similar function exists: use
load_csv_file()instead"). Impact: Proactive duplicate prevention during typing. -
Import suggestions - When Claude is about to write new code, automatically suggest existing imports. "You're about to write X, but Y already exists - import it?" Impact: Even less manual searching.
-
GitHub Action - CI/CD integration that fails PRs introducing duplicates above a threshold. Enforce duplication standards via automation. Impact: Prevent duplicates from ever being merged.
📊 Current Capabilities
What works today:
Language Support:
- ✅ Python (full support for functions, methods, async functions, decorators)
- 🚧 JavaScript, TypeScript, Go, Rust (on roadmap)
Analysis:
- ✅ Function names, signatures, locations, docstrings
- ✅ Parameter matching and signature comparison
- ✅ Duplicate detection (exact name matches)
- 🚧 Call graph analysis (planned)
- 🚧 Semantic similarity (planned)
- 🚧 Near-duplicate detection via Levenshtein distance (planned)
Indexing:
- ✅ Full repository indexing (~5 seconds for 800 functions)
- ✅ Manual reindexing on demand
- 🚧 Incremental updates (only changed files - planned)
- 🚧 Auto-reindexing on file changes (planned)
Search:
- ✅ Exact and partial name matching
- ✅ Parameter signature matching
- ✅ Multi-project support with auto-detection
- 🚧 Semantic search by behavior (planned)
- 🚧 Cross-project search (planned)
FAQ
Q: Does this work with other AI assistants?
Yes! CodeWalker uses the Model Context Protocol (MCP), which is an open standard. Any AI tool that supports MCP can use CodeWalker:
- Claude Code (tested)
- Claude Desktop (should work)
- Other MCP-compatible tools
Q: How much overhead does indexing add?
Very little:
- Initial indexing: ~5 seconds for 800 functions
- Reindexing: ~5 seconds (full rebuild)
- Search queries: < 1ms
- Memory: ~10 MB for typical projects
You barely notice it's there.
Q: What if my codebase is huge?
CodeWalker scales well:
- Tested on 800 functions / 60 files
- Should handle 10,000+ functions easily (SQLite scales)
- For massive codebases (100k+ functions), consider:
- Incremental indexing (planned feature)
- Multiple project registrations (already supported)
- Excluding test files or generated code
Q: Can I use this on proprietary code?
Yes! Everything is local:
- ✅ Index stored locally (~/.codewalker)
- ✅ No data sent to external services
- ✅ No network requests during search
- ✅ Your code never leaves your machine
CodeWalker is 100% private.
Q: How is this different from IDE autocomplete?
Complementary, not competing:
IDE autocomplete:
- Works in single file
- Shows available imports
- Type-aware suggestions
- Real-time as you type
CodeWalker:
- Works across entire codebase
- Searches by semantic intent ("load csv")
- Finds duplicates proactively
- Used by AI during code generation
Use both - IDE for writing, CodeWalker for AI-assisted development.
Q: What about private/internal functions?
CodeWalker indexes everything:
- Public functions: ✅ Indexed
- Private functions (
_private): ✅ Indexed - Internal functions (
__internal): ✅ Indexed
Why? Because you might want to reuse private functions too. Claude respects Python conventions (won't use _private from other modules without good reason), but knowing they exist prevents duplication.
Contributing
Contributions welcome! See CONTRIBUTING.md for guidelines.
Areas we need help:
- Multi-language support (JavaScript, TypeScript, Go)
- Incremental indexing
- Semantic similarity detection
- Performance optimization
License
MIT License - see LICENSE for details.
Free to use in personal and commercial projects.
Credits
Built to solve a real problem: Claude Code was creating duplicate implementations across a 60-file, 800-function codebase. CodeWalker eliminated the duplication.
Inspired by: Pharaoh (commercial tool for codebase intelligence)
Built with: Claude Sonnet 4.5 (dogfooding - using AI to build tools that improve AI)
Support
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: See guides in this repository
Summary
Problem: AI assistants can't see your codebase, causing massive code duplication.
Solution: CodeWalker indexes your codebase and lets AI search before writing.
Result: 40-60% reduction in duplicate code, faster development, cleaner codebase.
Get Started:
pip install -r requirements.txt
# Configure MCP (see Quick Start above)
> register_project("my-project", "/path/to/project")
> search_functions("whatever you're about to write")
Stop duplicating code. Start walking your codebase. 🚀