02 — Health Checks

Prerequisite: 01 — HTTP Gateway You will need: Working setup from recipe 01, ability to kill the test MCP server Time: 10 minutes Adds: Automatic health monitoring with state transitions on failure

The Problem

Your MCP server from recipe 01 crashes at 3 AM. Hangar doesn't know. It keeps sending requests to a dead endpoint and forwards cryptic connection errors back to Claude. Claude retries. More errors. The on-call engineer gets paged because "AI is broken" — but nobody knows which MCP server is down until someone checks logs. Hangar could have told you in 30 seconds.

The Config

# config.yaml — Recipe 02: Health Checks

mcp_servers:
  my-mcp:
    mode: remote
    endpoint: http://localhost:8080/mcp
    description: "My remote MCP server"
    health_check_interval_s: 30            # NEW: added in this recipe
    max_consecutive_failures: 3            # NEW: added in this recipe
    http:
      connect_timeout: 10.0
      read_timeout: 30.0

Save this as ~/.config/mcp-hangar/config.yaml (or update your existing file).

Try It

Start Hangar with health checks enabled (background process)

mcp-hangar --config ~/.config/mcp-hangar/config.yaml serve \
  --log-file /tmp/hangar.log &

INFO     background_worker_started task=health_check interval_s=60
[1] 12345

Note the process ID. Health check worker starts automatically.

Check initial status

cat > /tmp/check-status.sh << 'EOF'
#!/bin/bash
(
  echo '{"jsonrpc":"2.0","method":"initialize","params":{"protocolVersion":"2024-11-05","capabilities":{},"clientInfo":{"name":"test","version":"1.0"}},"id":1}'
  sleep 0.5
  echo '{"jsonrpc":"2.0","method":"notifications/initialized","params":{}}'
  sleep 0.5
  echo '{"jsonrpc":"2.0","method":"tools/call","params":{"name":"hangar_status","arguments":{}},"id":2}'
  sleep 2
) | nc localhost 8765 2>/dev/null || echo '{"result":"Use stdio mode or check logs"}'
EOF
chmod +x /tmp/check-status.sh

# Alternative: Check logs
tail -5 /tmp/hangar.log

INFO     mcp_server_state mcp_server_id=my-mcp state=cold
INFO     mcp_registry_ready mcp_servers=['my-mcp']

Trigger MCP server start

(
  echo '{"jsonrpc":"2.0","method":"initialize","params":{"protocolVersion":"2024-11-05","capabilities":{},"clientInfo":{"name":"test","version":"1.0"}},"id":1}'
  sleep 0.5
  echo '{"jsonrpc":"2.0","method":"notifications/initialized","params":{}}'
  sleep 0.5
  echo '{"jsonrpc":"2.0","method":"tools/call","params":{"name":"hangar_list","arguments":{}},"id":2}'
  sleep 3
  pkill -HUP -f "mcp-hangar"
) | mcp-hangar --config ~/.config/mcp-hangar/config.yaml serve 2>&1 | grep -E "ready|cold"

INFO     mcp_server_state_transition mcp_server_id=my-mcp from=cold to=ready

MCP Server transitioned to READY. Health checks now active.

Simulate MCP server failure

Kill the subprocess MCP server directly (if using subprocess mode) or simulate network failure (if using remote mode):
```
# Find and kill the subprocess mcp_server
ps aux | grep mcp-server-fetch | grep -v grep
kill <PID>
```
Or forcefully fail the MCP server by breaking its process.

Wait for health check cycle (40 seconds)

echo "Waiting for health checks to detect failure..."
sleep 40
tail -10 /tmp/hangar.log | grep -E "health_check|degraded"

WARNING  health_check_failed mcp_server_id=my-mcp error=Connection refused
WARNING  health_check_failed mcp_server_id=my-mcp error=Connection refused
WARNING  health_check_failed mcp_server_id=my-mcp error=Connection refused
WARNING  mcp_server_degraded_by_health_check mcp_server_id=my-mcp

Verify DEGRADED state

# Check the process still running
ps aux | grep "mcp-hangar" | grep -v grep

# Check latest logs
tail -3 /tmp/hangar.log

INFO     mcp_server_state mcp_server_id=my-mcp state=degraded consecutive_failures=3

MCP Server will auto-recover if restarted

For subprocess mode, Hangar will attempt restart on next invocation. For remote mode, restart the remote server manually.
Stop Hangar
```
pkill -f "mcp-hangar.*serve"
```

What Just Happened

Hangar's background health check worker probes each READY MCP server every 60 seconds. The probe mechanism sends a tools/list JSON-RPC request to the MCP server with a 5-second timeout. If the MCP server responds successfully, the health check passes and consecutive_failures resets to 0.

When the MCP server fails to respond, Hangar records a failure in the HealthTracker. After max_consecutive_failures (3 by default) failed checks, the MCP server state transitions from READY to DEGRADED. This transition emits a McpServerDegraded domain event, which updates metrics and triggers alerts.

When the MCP server comes back online, Hangar reinitializes it (DEGRADED -> INITIALIZING -> READY). No manual intervention required -- Hangar detected the failure and recovery without human involvement.

State machine transitions:

READY -> DEGRADED (after 3 consecutive failures)
DEGRADED -> INITIALIZING -> READY (on reinitialize after recovery)

Key Config Reference

Key	Type	Default	Description
`health_check_interval_s`	int	`30`	Per-MCP server health check interval (seconds)
`max_consecutive_failures`	int	`3`	Failures before state transition to DEGRADED

What's Next

Your MCP server is monitored — but when it starts failing intermittently, every request during the 30-second health check interval still hits a broken server. You need automatic protection that trips immediately on the first failure, not after 3 health checks.

→ 03 — Circuit Breaker