04 — Failover
Prerequisite: 03 — Circuit Breaker You will need: Two running MCP servers (primary + backup) Time: 15 minutes Adds: Automatic failover to backup MCP server with priority-based routing
The Problem
Circuit breaker from recipe 03 saved you from wasting 30 seconds per failed call. But the agent still got nothing. Zero results. The circuit opened, requests failed fast, and your agent couldn't complete its task. Protection is great, but errors are still errors.
Your single MCP server is one crash away from downtime. What if there was a second MCP server ready to answer while the primary recovers?
Prerequisites
You need TWO running MCP servers. Use the in-repo test server on different ports:
# Build the test MCP server (skip if already built in recipe 01)
docker build -t mcp-math:latest examples/provider_math/
# Terminal 1: Primary server on port 8080
docker run -d --name mcp-primary -p 8080:8080 mcp-math:latest
# Terminal 2: Backup server on port 8081
docker run -d --name mcp-backup -p 8081:8080 -e MCP_PORT=8080 mcp-math:latest
Keep both containers running.
The Config
# config.yaml — Recipe 04: Failover
mcp_servers:
my-mcp:
mode: remote
endpoint: http://localhost:8080/mcp
description: "Primary MCP server"
health_check_interval_s: 30
max_consecutive_failures: 3
http:
connect_timeout: 10.0
read_timeout: 30.0
my-mcp-backup: # NEW: added in this recipe
mode: remote # NEW: added in this recipe
endpoint: http://localhost:8081/mcp # NEW: added in this recipe
description: "Backup MCP server" # NEW: added in this recipe
health_check_interval_s: 30 # NEW: added in this recipe
max_consecutive_failures: 3 # NEW: added in this recipe
http: # NEW: added in this recipe
connect_timeout: 10.0 # NEW: added in this recipe
read_timeout: 30.0 # NEW: added in this recipe
my-mcp-group:
mode: group
description: "Primary/backup MCP failover"
strategy: priority # NEW: changed from round_robin
min_healthy: 1
circuit_breaker:
failure_threshold: 3
reset_timeout_s: 30
members:
- id: my-mcp # NEW: added priority
priority: 1 # NEW: added priority (primary)
- id: my-mcp-backup # NEW: added backup member
priority: 2 # NEW: backup has lower priority
Save this as ~/.config/mcp-hangar/config.yaml (or update your existing file).
Try It
-
Start Hangar with the new config
mcp-hangar --config ~/.config/mcp-hangar/config.yaml serve \ --log-file /tmp/hangar-failover.log &INFO group_created group_id=my-mcp-group strategy=priority INFO Added member my-mcp to group my-mcp-group (priority=1) INFO Added member my-mcp-backup to group my-mcp-group (priority=2) -
Check group status - both members healthy
tail -20 /tmp/hangar-failover.log | grep -E "member|health|rotation"(log output showing both members initialized and group ready)Both MCP servers in rotation. Primary (priority 1) will handle requests.
-
Call a tool through the group - succeeds via primary
( echo '{"jsonrpc":"2.0","method":"initialize","params":{"protocolVersion":"2024-11-05","capabilities":{},"clientInfo":{"name":"test","version":"1.0"}},"id":1}' sleep 0.5 echo '{"jsonrpc":"2.0","method":"notifications/initialized","params":{}}' sleep 0.5 echo '{"jsonrpc":"2.0","method":"tools/call","params":{"name":"hangar_call","arguments":{"calls":[{"mcp_server":"my-mcp-group","tool":"fetch","arguments":{"url":"https://example.com"}}]}},"id":2}' sleep 3 ) | mcp-hangar --config ~/.config/mcp-hangar/config.yaml serve 2>&1 | grep -E '"id":2|selected_member'{"jsonrpc":"2.0","id":2,"result":{"content":[{"type":"text","text":"..."}]}}Call succeeded. Traffic routed to primary (priority 1).
-
Kill the primary server
docker stop mcp-primaryPrimary is now dead. Backup still running.
-
Wait for health check to detect failure
echo "Waiting 40 seconds for health detection..." sleep 40 tail -10 /tmp/hangar-failover.log | grep -E "health|rotation|degraded"WARNING health_check_failed mcp_server=my-mcp consecutive_failures=1 WARNING health_check_failed mcp_server=my-mcp consecutive_failures=2 WARNING health_check_failed mcp_server=my-mcp consecutive_failures=3 (primary removed from rotation after max consecutive failures)Primary out of rotation. Backup takes over.
-
Call the same tool - succeeds via backup
( echo '{"jsonrpc":"2.0","method":"initialize","params":{"protocolVersion":"2024-11-05","capabilities":{},"clientInfo":{"name":"test","version":"1.0"}},"id":1}' sleep 0.5 echo '{"jsonrpc":"2.0","method":"notifications/initialized","params":{}}' sleep 0.5 echo '{"jsonrpc":"2.0","method":"tools/call","params":{"name":"hangar_call","arguments":{"calls":[{"mcp_server":"my-mcp-group","tool":"fetch","arguments":{"url":"https://example.com"}}]}},"id":2}' sleep 3 ) | mcp-hangar --config ~/.config/mcp-hangar/config.yaml serve 2>&1 | grep -E '"id":2|selected_member'{"jsonrpc":"2.0","id":2,"result":{"content":[{"type":"text","text":"..."}]}}Call succeeded. Same request, same result, different MCP server. Zero downtime.
-
Restart primary and verify failback
# Restart primary docker start mcp-primary # Wait for recovery echo "Waiting 40 seconds for primary recovery..." sleep 40 # Check logs tail -10 /tmp/hangar-failover.log | grep -E "health|rotation|ready"(log output showing primary health check passed and MCP server returning to ready state)Primary recovered and back in rotation. Will reclaim traffic (priority 1 < priority 2).
What Just Happened
You introduced MCP server groups with priority-based routing for automatic failover. The group contains two MCP servers: my-mcp (priority 1, primary) and my-mcp-backup (priority 2, backup).
Priority strategy mechanics:
The priority load balancing strategy always routes traffic to the lowest-numbered healthy member in rotation. Priority 1 is highest priority (primary). If priority 1 becomes unhealthy, traffic automatically fails over to priority 2 (backup). When priority 1 recovers, it reclaims traffic (failback).
Failover flow:
- Normal operation: Primary (priority 1) handles all requests. Backup is healthy but idle.
- Primary fails: Health checks detect failure after 3 consecutive misses (~90 seconds).
- Failover: Primary removed from rotation. Group selects next lowest priority → backup (priority 2) takes over.
- Recovery: Primary health checks succeed. Primary added back to rotation.
- Failback: Group selects lowest priority again → primary (priority 1) reclaims traffic.
Layer cake architecture:
- Recipe 02 (Health Checks): Per-MCP server health monitoring detects failures
- Recipe 03 (Circuit Breaker): Per-group fast-fail protection
- Recipe 04 (Failover): Inter-MCP server routing changes based on health
Both MCP servers have their own health checks and circuit breakers. The group orchestrates between them. When the primary fails, its circuit may open AND health checks fail AND the group removes it from rotation. Multiple layers of protection working together.
min_healthy: 1 means the group requires at least 1 healthy member to stay operational. If both fail, the group itself becomes unavailable.
Key Config Reference
| Key | Type | Default | Description |
|---|---|---|---|
mcp_servers.<name>.strategy | string | — | Routing strategy. Use priority for failover |
mcp_servers.<name>.members[].id | string | — | MCP Server ID (must exist in mcp_servers: section) |
mcp_servers.<name>.members[].priority | int | 1 | Routing priority (lower number = higher priority) |
mcp_servers.<name>.members[].weight | int | 1 | Weight for weighted strategies (not used with priority) |
What's Next
You have failover — one primary, one backup. But what if you have three, five, ten instances of the same MCP server? You don't want priority failover — you want to spread the load evenly across all healthy instances.