MCP Hub
Back to servers

FIS Recommender MCP Server

Analyzes DevOps findings to automatically recommend and generate AWS Fault Injection Simulator (FIS) experiment templates. It helps teams validate system resilience by mapping reported issues to specific chaos engineering actions across AWS services.

Updated
Feb 13, 2026

FIS Recommender MCP Server

An MCP (Model Context Protocol) server that automatically recommends AWS Fault Injection Simulator (FIS) experiments based on DevOps Agent findings. Helps teams quickly design chaos engineering experiments to validate system resilience.

Features

  • 🔍 Analyzes DevOps findings and suggests relevant FIS experiments
  • 🎯 Maps issues to appropriate fault injection actions
  • 📋 Generates complete FIS experiment templates
  • ⚡ Integrates seamlessly with Kiro CLI and other MCP clients

Installation

Clone the Repository

git clone https://github.com/pimisael/fis-recommender-mcp.git
cd fis-recommender-mcp
chmod +x server.py

Configure MCP Client

For Kiro CLI

Add to ~/.kiro/mcp-servers.json:

{
  "mcpServers": {
    "fis-recommender": {
      "command": "python3",
      "args": ["/absolute/path/to/fis-recommender-mcp/server.py"],
      "env": {
        "AWS_REGION": "us-east-1"
      }
    }
  }
}

For Claude Desktop

Add to ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "fis-recommender": {
      "command": "python3",
      "args": ["/absolute/path/to/fis-recommender-mcp/server.py"],
      "env": {
        "AWS_REGION": "us-east-1"
      }
    }
  }
}

Usage Examples

Example 1: Network Latency Issue

Prompt:

I have a DevOps finding about network latency causing timeouts in my application. 
Can you recommend FIS experiments to test this?

Finding details:
- ID: finding-001
- Summary: "High network latency between services causing request timeouts"
- Type: NETWORK_ISSUE

Response: The MCP server will recommend:

  • Action: aws:network:disrupt-connectivity
  • Duration: 10 minutes
  • Target: Network interfaces
  • Stop condition: CloudWatch alarm on error rate

Example 2: Database Availability

Prompt:

Recommend FIS experiments for this finding:
{
  "id": "finding-db-001",
  "summary": "Database connection failures during peak load",
  "type": "DATABASE_ISSUE"
}

Response:

  • Action: aws:rds:reboot-db-instances
  • Duration: 2 minutes
  • Target: RDS instances
  • Tests application's database failover handling

Example 3: CPU Stress Testing

Prompt:

We had a CPU spike incident. Generate a FIS template to test our auto-scaling.

Finding: "CPU utilization reached 95% causing service degradation"

Response: Complete FIS experiment template with:

  • EC2 instance stop action
  • 3-minute duration
  • CloudWatch alarm stop condition
  • Target selection by tags

Example 4: Memory Pressure

Prompt:

Create FIS experiments to validate our memory monitoring:
- Finding ID: mem-leak-001
- Issue: Memory leak caused OOM errors
- Need to test alerting and recovery

Response:

  • Action: aws:ssm:send-command (memory stress)
  • Duration: 5 minutes
  • SSM document for memory consumption
  • Tests monitoring and auto-recovery

Standalone Testing

Run the example script to test without an MCP client:

python3 example.py

This will analyze sample findings and display recommendations.

Supported Finding Types

Network & Connectivity

Finding KeywordFIS ActionDurationUse Case
networkaws:network:disrupt-connectivity5 minTest network partition handling
latencyaws:network:disrupt-connectivity10 minValidate timeout configurations
packet lossaws:ecs:task-network-packet-loss5 minSimulate packet loss scenarios
vpc endpointaws:network:disrupt-vpc-endpoint5 minTest VPC endpoint failures
cross-regionaws:network:route-table-disrupt-cross-region-connectivity10 minTest multi-region connectivity
transit gatewayaws:network:transit-gateway-disrupt-cross-region-connectivity10 minTest transit gateway issues
direct connectaws:directconnect:virtual-interface-disconnect5 minTest Direct Connect failures

Database & Storage

Finding KeywordFIS ActionDurationUse Case
databaseaws:rds:reboot-db-instances2 minTest database failover
rdsaws:rds:failover-db-cluster3 minTest RDS cluster failover
dynamodbaws:dynamodb:global-table-pause-replication5 minTest DynamoDB replication pause
aurora dsqlaws:dsql:cluster-connection-failure5 minTest Aurora DSQL failures
diskaws:ebs:pause-volume-io3 minTest disk I/O failures
ebsaws:ebs:volume-io-latency5 minInject EBS I/O latency
s3 replicationaws:s3:bucket-pause-replication10 minTest S3 replication pause

Compute & Instances

Finding KeywordFIS ActionDurationUse Case
cpuaws:ec2:stop-instances3 minValidate auto-scaling policies
memoryaws:ssm:send-command5 minTest OOM handling
instanceaws:ec2:reboot-instances2 minTest instance reboot resilience
spotaws:ec2:send-spot-instance-interruptions2 minTest spot interruption handling
capacityaws:ec2:api-insufficient-instance-capacity-error5 minTest capacity error handling
auto scalingaws:ec2:asg-insufficient-instance-capacity-error5 minTest ASG capacity errors

ECS & Containers

Finding KeywordFIS ActionDurationUse Case
ecsaws:ecs:stop-task2 minTest ECS task failure recovery
container cpuaws:ecs:task-cpu-stress5 minInject CPU stress on tasks
container memoryaws:ecs:task-io-stress5 minInject I/O stress on tasks
container networkaws:ecs:task-network-latency5 minInject network latency on tasks
drainaws:ecs:drain-container-instances5 minTest container draining

EKS & Kubernetes

Finding KeywordFIS ActionDurationUse Case
eksaws:eks:pod-delete2 minTest pod deletion recovery
pod cpuaws:eks:pod-cpu-stress5 minInject CPU stress on pods
pod memoryaws:eks:pod-memory-stress5 minInject memory stress on pods
pod networkaws:eks:pod-network-latency5 minInject network latency on pods
nodegroupaws:eks:terminate-nodegroup-instances3 minTest node termination
kubernetesaws:eks:inject-kubernetes-custom-resource5 minInject custom K8s faults

Lambda & Serverless

Finding KeywordFIS ActionDurationUse Case
lambdaaws:lambda:invocation-error5 minInject Lambda errors
lambda latencyaws:lambda:invocation-add-delay5 minAdd Lambda invocation delay
lambda httpaws:lambda:invocation-http-integration-response5 minTest Lambda HTTP failures

Lambda Chaos Engineering Best Practices

Testing Cold Starts and Timeouts:

  • Use aws:lambda:invocation-add-delay to simulate cold start scenarios
  • Set startupDelayMilliseconds higher than function timeout to test timeout handling
  • Validates retry logic, dead letter queues, and error handling

Error Handling Validation:

  • Use aws:lambda:invocation-error with preventExecution: true to test without running code
  • Set invocationPercentage to gradually increase fault injection (start at 10-20%)
  • Verify CloudWatch alarms fire and monitoring captures errors

Integration Testing:

  • Use aws:lambda:invocation-http-integration-response for ALB, API Gateway, VPC Lattice
  • Test upstream/downstream service behavior with custom HTTP status codes
  • Validate circuit breakers and fallback mechanisms

Continuous Testing in CI/CD:

  • Automate Lambda FIS experiments in AWS CodePipeline post-deployment
  • Use CloudWatch Synthetics to monitor user experience during experiments
  • Set stop conditions based on error rate thresholds (e.g., >5% errors)

Experiment Safety:

  • Start experiments in non-production with synthetic traffic
  • Use invocationPercentage parameter to limit blast radius
  • Configure CloudWatch alarms as stop conditions
  • Run during off-peak hours initially

Key Metrics to Monitor:

  • Invocation errors and throttles
  • Duration and billed duration
  • Concurrent executions
  • Dead letter queue messages
  • Downstream service health

Caching & Streaming

Finding KeywordFIS ActionDurationUse Case
elasticacheaws:elasticache:replicationgroup-interrupt-az-power5 minTest ElastiCache AZ failure
memorydbaws:memorydb:multi-region-cluster-pause-replication5 minTest MemoryDB replication
kinesisaws:kinesis:stream-provisioned-throughput-exception5 minTest Kinesis throughput
kinesis iteratoraws:kinesis:stream-expired-iterator-exception3 minTest expired iterator handling

API & Throttling

Finding KeywordFIS ActionDurationUse Case
api throttleaws:fis:inject-api-throttle-error5 minInject API throttling
api erroraws:fis:inject-api-internal-error5 minInject API internal errors
api unavailableaws:fis:inject-api-unavailable-error5 minInject API unavailable errors

Availability & Recovery

Finding KeywordFIS ActionDurationUse Case
availabilityaws:ec2:stop-instances5 minTest high availability setup
zonalaws:arc:start-zonal-autoshift10 minTest zonal autoshift
alarmaws:cloudwatch:assert-alarm-state1 minValidate alarm states

Available Tools

1. recommend_fis_experiments

Analyzes DevOps Agent findings and returns FIS experiment recommendations.

Input:

{
  "finding": {
    "id": "finding-123",
    "summary": "Network latency caused timeouts",
    "type": "AVAILABILITY_ISSUE"
  }
}

Output:

{
  "recommendations": [
    {
      "action": "aws:network:disrupt-connectivity",
      "duration": "PT10M",
      "description": "Simulates network disruption to test timeout handling",
      "targets": ["NetworkInterface"],
      "stopConditions": ["CloudWatch alarm on error rate > 5%"]
    }
  ],
  "finding_id": "finding-123",
  "count": 1
}

2. create_fis_template

Generates a complete, ready-to-deploy FIS experiment template.

Input:

{
  "recommendation": {
    "action": "aws:ec2:stop-instances",
    "duration": "PT3M",
    "description": "Test instance failure recovery"
  },
  "target_config": {
    "resourceType": "aws:ec2:instance",
    "selectionMode": "COUNT(1)",
    "tags": {
      "Environment": "staging",
      "Team": "platform"
    },
    "roleArn": "arn:aws:iam::123456789012:role/FISRole"
  }
}

Output: Complete CloudFormation-compatible FIS experiment template ready for deployment.

Customization

Adding New Finding Mappings

Edit server.py and add to the finding_mappings dictionary:

finding_mappings = {
    "disk": {
        "action": "aws:ebs:pause-volume-io",
        "duration": "PT5M",
        "description": "Simulates disk I/O issues"
    },
    # Add your custom mappings here
}

Adjusting Durations

Modify duration values in ISO 8601 format:

  • PT2M = 2 minutes
  • PT5M = 5 minutes
  • PT10M = 10 minutes
  • PT1H = 1 hour

Requirements

  • Python 3.7+
  • AWS credentials configured (for actual FIS deployment)
  • MCP-compatible client (Kiro CLI, Claude Desktop, etc.)

Chaos Engineering Best Practices

The Chaos Engineering Flywheel

Follow the scientific method for each experiment:

  1. Define Steady State - Establish measurable baseline metrics (TPS, latency, error rate)
  2. Form Hypothesis - Predict how the system will respond to the fault
  3. Run Experiment - Inject the fault in a controlled manner
  4. Verify Results - Compare actual behavior against hypothesis
  5. Improve - Address gaps and re-run experiments

Experiment Safety Guidelines

Start Small, Scale Gradually:

  • Begin in non-production environments
  • Use synthetic traffic before real customer traffic
  • Start with low percentages (10-20%) and increase gradually
  • Run during off-peak hours initially

Implement Guardrails:

  • Set CloudWatch alarms as stop conditions
  • Define clear rollback procedures
  • Monitor blast radius with real-time dashboards
  • Communicate with operations teams before experiments

Scope and Impact:

  • Clearly define experiment boundaries
  • Use tags to target specific resources
  • Limit concurrent experiments
  • Document expected vs. actual impact

Continuous Chaos Testing

Automate in CI/CD:

  • Integrate FIS experiments into AWS CodePipeline
  • Run experiments post-deployment automatically
  • Use results to gate production releases
  • Track experiment results over time

Game Days:

  • Schedule regular chaos engineering sessions
  • Simulate realistic failure scenarios
  • Test incident response procedures
  • Validate runbooks and documentation

Key Metrics to Track

System Health:

  • Request success rate (target: >99.9%)
  • Latency percentiles (p50, p95, p99)
  • Error rates (4xx, 5xx)
  • Resource utilization (CPU, memory, connections)

Resilience Indicators:

  • Time to detect failures
  • Time to recovery
  • Blast radius of failures
  • Cascading failure prevention

Common Failure Scenarios

Network Failures:

  • Partition tolerance between services
  • Cross-region connectivity loss
  • DNS resolution failures
  • Increased latency and packet loss

Resource Exhaustion:

  • CPU and memory pressure
  • Connection pool exhaustion
  • Disk I/O saturation
  • API throttling and rate limits

Dependency Failures:

  • Database failover and replication lag
  • Cache invalidation and cold starts
  • Third-party API unavailability
  • Message queue backlogs

References

License

MIT

Contributing

Issues and pull requests welcome at https://github.com/pimisael/fis-recommender-mcp

Reviews

No reviews yet

Sign in to write a review