13 -- Production Checklist
Before you go live, walk through this list.
Security
- [ ] TLS termination configured (reverse proxy or load balancer)
- [ ]
auth.enabled: trueandauth.allow_anonymous: false - [ ] API keys created for each service principal
- [ ] RBAC roles assigned with least-privilege
- [ ] Tool access policies set for sensitive tools
- [ ]
MCP_AUTH_SECRETset to a stable, random value (not auto-generated) - [ ] Secrets use environment variable interpolation (
${VAR}), not plain text in config - [ ] Docker providers use
read_only: trueandnetwork: nonewhere possible
Reliability
- [ ] Health checks enabled on all providers (
health_check_interval_s) - [ ] Circuit breaker thresholds tuned (
max_consecutive_failures) - [ ] Provider groups configured for critical providers (at least 2 members)
- [ ]
min_healthyset to match your SLA requirements - [ ] Idle TTL set appropriately (300s for subprocess, 600s for containers)
- [ ] Rate limiting enabled to prevent overload
- [ ] Event store configured (
event_store.driver: sqlite)
Observability
- [ ] Prometheus scraping
/metricsendpoint - [ ] Grafana dashboards imported from
monitoring/grafana/ - [ ] Alertmanager rules configured for:
- Provider state transitions to DEAD
- Circuit breaker OPEN events
- Health check failure rate above threshold
- Tool call error rate above threshold
- [ ] Structured JSON logging enabled (
MCP_JSON_LOGS=true) - [ ] Log level set to
INFOfor production (MCP_LOG_LEVEL=INFO)
Configuration
- [ ] Config file validated (
mcp-hangar validate config.yaml) - [ ] Hot-reload tested (
kill -HUPor file edit) - [ ] Config backup strategy in place (use
/api/config/backup) - [ ] Environment-specific configs separated (dev/staging/prod)
Deployment
- [ ] Running behind a reverse proxy (nginx, Caddy, Envoy)
- [ ] Health check endpoint exposed for orchestrator (
/api/system) - [ ] Graceful shutdown configured (SIGTERM handling)
- [ ] Resource limits set (memory, CPU) for container deployments
- [ ] Persistent volume for event store SQLite database
- [ ] Docker image pinned to specific version tag, not
latest
Kubernetes (if applicable)
- [ ] MCP-Hangar Operator installed
- [ ] CRDs applied (
MCPProvider,MCPProviderGroup,MCPDiscoverySource) - [ ] RBAC (Kubernetes) configured for operator service account
- [ ] Network policies restricting provider-to-provider communication
- [ ] Resource requests and limits in Helm values
- [ ] PodDisruptionBudget for Hangar deployment
Testing
- [ ] Failover tested: kill a primary provider, verify backup takes over
- [ ] Cold start tested: invoke a tool on a cold provider, verify latency
- [ ] Rate limit tested: flood API, verify 429 responses
- [ ] Auth tested: invalid key returns 401, insufficient role returns 403
- [ ] Config reload tested: edit config.yaml, verify changes apply
- [ ] Recovery tested: kill all providers, verify they reinitialize
Runbook
- [ ] Incident response documented
- [ ] Provider restart procedure documented
- [ ] Config rollback procedure documented
- [ ] Contact list for provider owners maintained