PyTorch Documentation Search Tool (Project Paused)
A semantic search prototype for PyTorch documentation with command-line capabilities.
Current Status (April 19, 2025)
⚠️ This project is currently paused for significant redesign.
The tool provides a basic command-line search interface for PyTorch documentation but requires substantial improvements in several areas. While the core embedding and search functionality works at a basic level, both relevance quality and MCP integration require additional development.
Example Output
$ python scripts/search.py "How are multi-attention heads plotted out in PyTorch?"
Found 5 results for 'How are multi-attention heads plotted out in PyTorch?':
--- Result 1 (code) ---
Title: plot_visualization_utils.py
Source: plot_visualization_utils.py
Score: 0.3714
Snippet: # models. Let's start by analyzing the output of a Mask-RCNN model. Note that...
--- Result 2 (code) ---
Title: plot_transforms_getting_started.py
Source: plot_transforms_getting_started.py
Score: 0.3571
Snippet: https://github.com/pytorch/vision/tree/main/gallery/...
What Works
✅ Basic Semantic Search: Command-line interface for querying PyTorch documentation
✅ Vector Database: Functional ChromaDB integration for storing and querying embeddings
✅ Content Differentiation: Distinguishes between code and text content
✅ Interactive Mode: Option to run continuous interactive queries in a session
What Needs Improvement
❌ Relevance Quality: Moderate similarity scores (0.35-0.37) indicate suboptimal results
❌ Content Coverage: Specialized topics may have insufficient representation in the database
❌ Chunking Strategy: Current approach breaks documentation at arbitrary points
❌ Result Presentation: Snippets are too short and lack sufficient context
❌ MCP Integration: Connection timeout issues prevent Claude Code integration
Getting Started
Environment Setup
Create a conda environment with all dependencies:
conda env create -f environment.yml
conda activate pytorch_docs_search
API Key Setup
The tool requires an OpenAI API key for embedding generation:
export OPENAI_API_KEY=your_key_here
Command-line Usage
# Search with a direct query
python scripts/search.py "your search query here"
# Run in interactive mode
python scripts/search.py --interactive
# Additional options
python scripts/search.py "query" --results 5 # Limit to 5 results
python scripts/search.py "query" --filter code # Only code results
python scripts/search.py "query" --json # Output in JSON format
Project Architecture
ptsearch/core/: Core search functionality (database, embedding, search)ptsearch/config/: Configuration managementptsearch/utils/: Utility functions and loggingscripts/: Command-line toolsdata/: Embedded documentation and databaseptsearch/protocol/: MCP protocol handling (currently unused)ptsearch/transport/: Transport implementations (STDIO, SSE) (currently unused)
Why This Project Is Paused
After evaluating the current implementation, we've identified several challenges that require significant redesign:
-
Data Quality Issues: The current embedding approach doesn't capture semantic relationships between PyTorch concepts effectively enough. Relevance scores around 0.35-0.37 are too low for a quality user experience.
-
Chunking Limitations: Our current method divides documentation into chunks based on character count rather than conceptual boundaries, leading to fragmented results.
-
MCP Integration Problems: Despite multiple implementation approaches, we encountered persistent timeout issues when attempting to integrate with Claude Code:
- STDIO integration failed at connection establishment
- Flask server with SSE transport couldn't maintain stable connections
- UVX deployment experienced similar timeout issues
Future Roadmap
When development resumes, we plan to focus on:
- Improved Chunking Strategy: Implement semantic chunking that preserves conceptual boundaries
- Enhanced Result Formatting: Provide more context and better snippet selection
- Expanded Documentation Coverage: Ensure comprehensive representation of all PyTorch topics
- MCP Integration Redesign: Work with the Claude team to resolve timeout issues
Development
Running Tests
pytest -v tests/
Format Code
black .
License
MIT License