MCP Hub
Back to servers

word-doc-mcp

A comprehensive MCP server for Word document processing featuring table extraction, image OCR, full-text search, and performance optimizations for large files.

Tools
7
Updated
Dec 13, 2025
Validated
Jan 11, 2026

Word Document Reader MCP Server

A powerful Word document reading MCP server with table extraction, image OCR analysis, large document optimization, and intelligent caching.

🚀 Core Features

1. Document Content Extraction

  • ✅ Word document (.docx/.doc) text extraction
  • ✅ Support for mixed Chinese-English documents
  • ✅ Preserve original formatting and structure

2. Table Extraction

  • ✅ Automatically identify and extract tables from Word documents
  • ✅ Convert to structured data format
  • ✅ Preserve table row/column structure information
  • ✅ Support complex table parsing

3. Image OCR Analysis

  • ✅ Extract embedded images from Word documents
  • ✅ High-precision OCR recognition using Tesseract.js v5
  • ✅ Support mixed Chinese-English text recognition (95%+ accuracy)
  • ✅ Intelligent image preprocessing for better recognition
  • ✅ Support multiple image formats (JPG, PNG, GIF, BMP, WebP)

4. Large Document Optimization

  • ✅ Automatic detection of large documents (>10MB or >100 pages)
  • ✅ Worker thread parallel processing, utilizing multi-core CPUs
  • ✅ Chunked processing to avoid memory overflow
  • ✅ 60%+ speed improvement

5. Intelligent Caching System

  • ✅ File system persistent caching
  • ✅ Smart cache invalidation based on file modification time
  • ✅ Cache statistics and management support
  • ✅ 90%+ speed improvement for repeated document processing

6. Full-text Index Search

  • ✅ Millisecond-level search with inverted index
  • ✅ Intelligent Chinese-English word segmentation
  • ✅ Relevance scoring and sorting
  • ✅ Support document type filtering

📦 Installation and Usage

1. Install Dependencies

npm install

2. Start Server

# Start full-featured version
npm start

# Or start basic version (without advanced features)
npm run start:basic

3. Run Tests

# Run all tests
npm test

# Run tests in watch mode
npm run test:watch

# Generate test coverage report
npm run test:coverage

read_word_document

Read and analyze Word documents

{
  "name": "read_word_document",
  "arguments": {
    "filePath": "path/to/document.docx",
    "memoryKey": "my-document",
    "documentType": "api-doc",
    "extractTables": true,
    "extractImages": true,
    "useCache": true,
    "outputDir": "./output"
  }
}

search_documents

Full-text index search

{
  "name": "search_documents",
  "arguments": {
    "query": "search keywords",
    "documentType": "api-doc",
    "limit": 10
  }
}

get_cache_stats

Get cache statistics

{
  "name": "get_cache_stats"
}

clear_cache

Clear cache

{
  "name": "clear_cache",
  "arguments": {
    "type": "all"  // "all", "document", "index"
  }
}

list_stored_documents

List stored documents

{
  "name": "list_stored_documents",
  "arguments": {
    "documentType": "api-doc"
  }
}

get_stored_document

Get specific document content

{
  "name": "get_stored_document",
  "arguments": {
    "memoryKey": "document-key"
  }
}

clear_memory

Clear memory content

{
  "name": "clear_memory",
  "arguments": {
    "memoryKey": "specific-key"  // Optional, clear all if not provided
  }
}

📁 Project Structure

word-doc-mcp/
├── server.js              # Main server file (with all features)
├── server-basic.js        # Basic server (compatibility)
├── package.json           # Project configuration and dependencies
├── config.json           # Server configuration file
├── tests/                # Test directory
│   ├── setup.js          # Test environment setup
│   ├── unit/             # Unit tests
│   │   └── services/     # Service layer tests
│   ├── integration/      # Integration tests
│   │   ├── tools/        # Tool tests
│   │   └── cache/        # Cache tests
│   └── fixtures/         # Test data
│       ├── documents/    # Test documents
│       └── mock-data.js  # Mock data
├── .cache/               # Cache directory (auto-created)
├── output/               # Output directory (auto-created)
└── node_modules/         # Dependencies

⚙️ Configuration

Edit the config.json file to customize server behavior:

{
  "processing": {
    "maxFileSize": 10485760,
    "maxPages": 100,
    "chunkSize": 1048576,
    "parallelProcessing": true
  },
  "cache": {
    "enabled": true,
    "defaultTTL": 3600,
    "cacheDirectory": "./.cache"
  },
  "ocr": {
    "enabled": true,
    "languages": ["chi_sim", "eng"]
  }
}

🧪 Testing

Test Framework

Using Node.js built-in test framework, following these standards:

  • Unit Tests: Test individual components and functions
  • Integration Tests: Test interactions between tools
  • End-to-End Tests: Test complete workflows

Running Tests

# Run all tests
npm test

# Run specific test file
node --test tests/unit/services/DocumentIndexer.test.js

# Run integration tests
node --test tests/integration/

# Generate coverage report
npm run test:coverage

Test Coverage

  • ✅ Functional tests for all MCP tools
  • ✅ Complete cache system tests
  • ✅ Error handling and edge cases
  • ✅ Performance and concurrency tests
  • ✅ End-to-end workflow tests

📊 Performance Metrics

  • Large Document Processing: 60%+ speed improvement (parallel processing)
  • Repeated Document Processing: 90%+ speed improvement (caching)
  • OCR Recognition Accuracy: 95%+ (image preprocessing)
  • Memory Usage Optimization: 40% reduction (streaming processing)
  • Search Response Time: <100ms (full-text index)

🛡️ Security Considerations

  • Input file size limits
  • File type validation
  • Cache data isolation
  • Error handling and logging
  • Automatic temporary file cleanup

🔄 Version Compatibility

Backward Compatibility

  • ✅ Maintain full compatibility with original API
  • ✅ Existing tool functionality unchanged
  • ✅ Optional configuration with reasonable defaults
  • ✅ Provide basic version to ensure compatibility

System Requirements

Minimum Requirements:

  • Node.js 16+
  • 4GB RAM
  • 1GB disk space

Recommended Configuration:

  • Node.js 18+
  • 8GB+ RAM
  • Multi-core CPU
  • SSD storage

🐛 Troubleshooting

Common Issues

  1. Module Installation Failure

    npm cache clean --force
    npm install
    
  2. OCR Recognition Failure

    • Ensure sufficient memory (8GB+ recommended)
    • Check supported image formats
    • Review error logs
  3. Slow Large Document Processing

    • Enable parallel processing
    • Adjust chunkSize configuration
    • Use SSD storage
  4. Memory Insufficient

    node --max-old-space-size=4096 server.js
    

📝 Changelog

v2.0.0

  • ✅ Add table extraction functionality
  • ✅ Add image OCR analysis
  • ✅ Implement large document parallel processing
  • ✅ Add intelligent caching system
  • ✅ Implement full-text index search
  • ✅ Complete testing framework

v1.0.0

  • ✅ Basic Word document reading
  • ✅ Memory storage management
  • ✅ Simple search functionality

🤝 Contributing

Issues and Pull Requests are welcome!

Development Guidelines

  1. Fork the project
  2. Create feature branch
  3. Write test cases
  4. Ensure all tests pass
  5. Submit Pull Request

📄 License

MIT License


Quick Start: npm install && npm start

Reviews

No reviews yet

Sign in to write a review