13 -- Production Checklist

Before you go live, walk through this list.

Security

TLS termination configured (reverse proxy or load balancer)
auth.enabled: true and auth.allow_anonymous: false
API keys created for each service principal
RBAC roles assigned with least-privilege
Tool access policies set for sensitive tools
Secrets use environment variable interpolation (${VAR}), not plain text in config
Docker MCP servers use read_only: true and network: none where possible

Reliability

Health checks enabled on all MCP servers (health_check_interval_s)
Circuit breaker thresholds tuned (max_consecutive_failures)
MCP Server groups configured for critical MCP servers (at least 2 members)
min_healthy set to match your SLA requirements
Idle TTL set appropriately (300s for subprocess, 600s for containers)
Rate limiting enabled to prevent overload
Event store configured (event_store.driver: sqlite)

Observability

Prometheus scraping /metrics endpoint
Grafana dashboards imported from monitoring/grafana/
Alertmanager rules configured for:
- MCP server state transitions to DEAD
- Circuit breaker OPEN events
- Health check failure rate above threshold
- Tool call error rate above threshold
Structured JSON logging enabled (MCP_JSON_LOGS=true)
Log level set to INFO for production (MCP_LOG_LEVEL=INFO)

Configuration

Config file reviewed for correctness (no validate subcommand exists)
Hot-reload tested via mcp-hangar add API (no SIGHUP handler exists)
Environment-specific configs separated (dev/staging/prod)

Deployment

Running behind a reverse proxy (nginx, Caddy, Envoy)
Health probe endpoints exposed for orchestrator (/health/live, /health/ready, /health/startup)
Graceful shutdown configured (SIGTERM handling)
Resource limits set (memory, CPU) for container deployments
Persistent volume for event store SQLite database
Docker image pinned to specific version tag, not latest

Kubernetes (if applicable)

The MCP-Hangar Operator is an external component shipped from hangar-operator. See Recipe 11 for install instructions.

MCP-Hangar Operator installed (see Recipe 11 prerequisites)
CRDs applied (MCPServer, MCPServerGroup, MCPDiscoverySource)
RBAC (Kubernetes) configured for operator service account
Network policies restricting MCP server-to-MCP server communication
Resource requests and limits in Helm values
PodDisruptionBudget for Hangar deployment

Testing

Failover tested: kill a primary MCP server, verify backup takes over
Cold start tested: invoke a tool on a cold MCP server, verify latency
Rate limit tested: flood API, verify 429 responses
Auth tested: invalid key returns 401, insufficient role returns 403
Config reload tested: edit config.yaml, verify changes apply
Recovery tested: kill all MCP servers, verify they reinitialize

Runbook

Incident response documented
MCP Server restart procedure documented
Config rollback procedure documented
Contact list for MCP server owners maintained