13 -- Production Checklist
Before you go live, walk through this list.
Security
- TLS termination configured (reverse proxy or load balancer)
-
auth.enabled: trueandauth.allow_anonymous: false - API keys created for each service principal
- RBAC roles assigned with least-privilege
- Tool access policies set for sensitive tools
- Secrets use environment variable interpolation (
${VAR}), not plain text in config - Docker MCP servers use
read_only: trueandnetwork: nonewhere possible
Reliability
- Health checks enabled on all MCP servers (
health_check_interval_s) - Circuit breaker thresholds tuned (
max_consecutive_failures) - MCP Server groups configured for critical MCP servers (at least 2 members)
-
min_healthyset to match your SLA requirements - Idle TTL set appropriately (300s for subprocess, 600s for containers)
- Rate limiting enabled to prevent overload
- Event store configured (
event_store.driver: sqlite)
Observability
- Prometheus scraping
/metricsendpoint - Grafana dashboards imported from
monitoring/grafana/ - Alertmanager rules configured for:
- MCP server state transitions to DEAD
- Circuit breaker OPEN events
- Health check failure rate above threshold
- Tool call error rate above threshold
- Structured JSON logging enabled (
MCP_JSON_LOGS=true) - Log level set to
INFOfor production (MCP_LOG_LEVEL=INFO)
Configuration
- Config file reviewed for correctness (no
validatesubcommand exists) - Hot-reload tested via
mcp-hangar addAPI (no SIGHUP handler exists) - Environment-specific configs separated (dev/staging/prod)
Deployment
- Running behind a reverse proxy (nginx, Caddy, Envoy)
- Health probe endpoints exposed for orchestrator (
/health/live,/health/ready,/health/startup) - Graceful shutdown configured (SIGTERM handling)
- Resource limits set (memory, CPU) for container deployments
- Persistent volume for event store SQLite database
- Docker image pinned to specific version tag, not
latest
Kubernetes (if applicable)
The MCP-Hangar Operator is an external component shipped from hangar-operator. See Recipe 11 for install instructions.
- MCP-Hangar Operator installed (see Recipe 11 prerequisites)
- CRDs applied (
MCPServer,MCPServerGroup,MCPDiscoverySource) - RBAC (Kubernetes) configured for operator service account
- Network policies restricting MCP server-to-MCP server communication
- Resource requests and limits in Helm values
- PodDisruptionBudget for Hangar deployment
Testing
- Failover tested: kill a primary MCP server, verify backup takes over
- Cold start tested: invoke a tool on a cold MCP server, verify latency
- Rate limit tested: flood API, verify 429 responses
- Auth tested: invalid key returns 401, insufficient role returns 403
- Config reload tested: edit config.yaml, verify changes apply
- Recovery tested: kill all MCP servers, verify they reinitialize
Runbook
- Incident response documented
- MCP Server restart procedure documented
- Config rollback procedure documented
- Contact list for MCP server owners maintained