13 -- Production Checklist

Before you go live, walk through this list.

Security

  • TLS termination configured (reverse proxy or load balancer)
  • auth.enabled: true and auth.allow_anonymous: false
  • API keys created for each service principal
  • RBAC roles assigned with least-privilege
  • Tool access policies set for sensitive tools
  • Secrets use environment variable interpolation (${VAR}), not plain text in config
  • Docker MCP servers use read_only: true and network: none where possible

Reliability

  • Health checks enabled on all MCP servers (health_check_interval_s)
  • Circuit breaker thresholds tuned (max_consecutive_failures)
  • MCP Server groups configured for critical MCP servers (at least 2 members)
  • min_healthy set to match your SLA requirements
  • Idle TTL set appropriately (300s for subprocess, 600s for containers)
  • Rate limiting enabled to prevent overload
  • Event store configured (event_store.driver: sqlite)

Observability

  • Prometheus scraping /metrics endpoint
  • Grafana dashboards imported from monitoring/grafana/
  • Alertmanager rules configured for:
    • MCP server state transitions to DEAD
    • Circuit breaker OPEN events
    • Health check failure rate above threshold
    • Tool call error rate above threshold
  • Structured JSON logging enabled (MCP_JSON_LOGS=true)
  • Log level set to INFO for production (MCP_LOG_LEVEL=INFO)

Configuration

  • Config file reviewed for correctness (no validate subcommand exists)
  • Hot-reload tested via mcp-hangar add API (no SIGHUP handler exists)
  • Environment-specific configs separated (dev/staging/prod)

Deployment

  • Running behind a reverse proxy (nginx, Caddy, Envoy)
  • Health probe endpoints exposed for orchestrator (/health/live, /health/ready, /health/startup)
  • Graceful shutdown configured (SIGTERM handling)
  • Resource limits set (memory, CPU) for container deployments
  • Persistent volume for event store SQLite database
  • Docker image pinned to specific version tag, not latest

Kubernetes (if applicable)

The MCP-Hangar Operator is an external component shipped from hangar-operator. See Recipe 11 for install instructions.

  • MCP-Hangar Operator installed (see Recipe 11 prerequisites)
  • CRDs applied (MCPServer, MCPServerGroup, MCPDiscoverySource)
  • RBAC (Kubernetes) configured for operator service account
  • Network policies restricting MCP server-to-MCP server communication
  • Resource requests and limits in Helm values
  • PodDisruptionBudget for Hangar deployment

Testing

  • Failover tested: kill a primary MCP server, verify backup takes over
  • Cold start tested: invoke a tool on a cold MCP server, verify latency
  • Rate limit tested: flood API, verify 429 responses
  • Auth tested: invalid key returns 401, insufficient role returns 403
  • Config reload tested: edit config.yaml, verify changes apply
  • Recovery tested: kill all MCP servers, verify they reinitialize

Runbook

  • Incident response documented
  • MCP Server restart procedure documented
  • Config rollback procedure documented
  • Contact list for MCP server owners maintained