Skip to content

Observability Guide

This guide covers MCP Hangar's observability features: metrics, tracing, logging, and health checks.

Table of Contents

Quick Start

Prerequisites

bash
# Core package
pip install mcp-hangar

# For full observability support
pip install mcp-hangar[observability]

Start Monitoring Stack

The monitoring stack is in monitoring/ and includes Prometheus, Grafana, and Alertmanager:

bash
# Using Docker Compose
cd monitoring
docker compose up -d

# Using Podman
cd monitoring
podman compose up -d

Access dashboards:

ServiceURLCredentials
Grafanahttp://localhost:3000admin / admin
Prometheushttp://localhost:9090-
Alertmanagerhttp://localhost:9093-

Start MCP Hangar with Metrics

bash
# HTTP mode (exposes /metrics endpoint)
mcp-hangar serve --http --port 8000

# With custom config
MCP_CONFIG=config.yaml mcp-hangar serve --http --port 8000

Verify metrics are exposed:

bash
curl http://localhost:8000/metrics | grep mcp_hangar

Monitoring Stack

Architecture

+----------------+     scrape      +------------+
|  MCP Hangar    |---------------->| Prometheus |
|  :8000/metrics |                 |   :9090    |
+----------------+                 +-----+------+
                                         |
                                         | query
                                         v
                                   +------------+
                                   |  Grafana   |
                                   |   :3000    |
                                   +------------+

+----------------+     alerts      +-------------+
|  Prometheus    |---------------->| Alertmanager|
|  alert rules   |                 |    :9093    |
+----------------+                 +-------------+

Configuration Files

FilePurpose
monitoring/docker-compose.yamlContainer orchestration
monitoring/prometheus/prometheus.yamlScrape configuration
monitoring/prometheus/alerts.yamlAlert rules
monitoring/alertmanager/alertmanager.yamlNotification routing
monitoring/grafana/provisioning/Dashboard/datasource provisioning
monitoring/grafana/dashboards/Pre-built dashboard JSON files

Prometheus Configuration

The default configuration scrapes MCP Hangar every 10 seconds:

yaml
# monitoring/prometheus/prometheus.yaml
scrape_configs:
  - job_name: 'mcp-hangar'
    static_configs:
      - targets: ['host.docker.internal:8000']
        labels:
          service: 'mcp-hangar'
          tier: 'application'
    metrics_path: /metrics
    scrape_interval: 10s
    scrape_timeout: 5s

For Kubernetes deployments, use service discovery:

yaml
scrape_configs:
  - job_name: 'mcp-hangar'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        regex: mcp-hangar
        action: keep

Metrics

MCP Hangar exports Prometheus metrics at /metrics. All metrics use the mcp_hangar_ prefix.

Currently Exported Metrics

Tool Invocations

MetricTypeLabelsDescription
mcp_hangar_tool_calls_totalCounterprovider, tool, statusTotal tool invocations
mcp_hangar_tool_call_duration_secondsHistogramprovider, toolInvocation latency (buckets: 0.01-30s)
mcp_hangar_tool_call_errors_totalCounterprovider, tool, error_typeFailed invocations by error type

Example queries:

promql
# Tool call rate by provider
sum(rate(mcp_hangar_tool_calls_total[5m])) by (provider)

# P95 latency by tool
histogram_quantile(0.95, sum(rate(mcp_hangar_tool_call_duration_seconds_bucket[5m])) by (le, tool))

# Error rate
sum(rate(mcp_hangar_tool_call_errors_total[5m])) / sum(rate(mcp_hangar_tool_calls_total[5m]))

Batch Invocations

MetricTypeLabelsDescription
mcp_hangar_batch_calls_totalCounterresultBatch invocations (success/failure)
mcp_hangar_batch_duration_secondsHistogram-Batch execution time
mcp_hangar_batch_sizeHistogram-Number of calls per batch
mcp_hangar_batch_cancellations_totalCounter-Cancelled batches
mcp_hangar_batch_circuit_breaker_rejections_totalCounter-Circuit breaker rejections
mcp_hangar_batch_concurrencyGauge-Current parallel executions

Example queries:

promql
# Batch success rate
sum(rate(mcp_hangar_batch_calls_total{result="success"}[5m]))
/ sum(rate(mcp_hangar_batch_calls_total[5m]))

# Average batch size
rate(mcp_hangar_batch_size_sum[5m]) / rate(mcp_hangar_batch_size_count[5m])

Health Checks

MetricTypeLabelsDescription
mcp_hangar_health_checks_totalCounterprovider, resultHealth check executions
mcp_hangar_health_check_duration_secondsHistogramproviderHealth check latency
mcp_hangar_health_check_consecutive_failuresGaugeproviderCurrent consecutive failure count

Example queries:

promql
# Unhealthy providers (>2 consecutive failures)
mcp_hangar_health_check_consecutive_failures > 2

# Health check success rate
sum(rate(mcp_hangar_health_checks_total{result="healthy"}[5m])) by (provider)
/ sum(rate(mcp_hangar_health_checks_total[5m])) by (provider)

Provider Lifecycle

MetricTypeLabelsDescription
mcp_hangar_provider_starts_totalCounterproviderProvider start attempts
mcp_hangar_provider_initializedGaugeprovider1 if provider has been initialized

GC (Garbage Collection)

MetricTypeLabelsDescription
mcp_hangar_gc_cycles_totalCounter-GC cycle executions
mcp_hangar_gc_cycle_duration_secondsHistogram-GC cycle duration

Metrics Not Yet Implemented

The following metrics are defined in code but not currently populated. They are planned for future releases:

  • mcp_hangar_provider_state - Provider state gauge (cold/ready/degraded/dead)
  • mcp_hangar_provider_up - Provider availability
  • mcp_hangar_provider_cold_start_seconds - Cold start latency histogram
  • mcp_hangar_discovery_* - Auto-discovery metrics
  • mcp_hangar_http_* - HTTP transport metrics (for remote providers)
  • mcp_hangar_rate_limit_hits_total - Rate limiting metrics
  • mcp_hangar_connections_* - Connection tracking

Grafana Dashboards

Pre-built dashboards are provisioned automatically from monitoring/grafana/dashboards/:

Overview Dashboard

File: overview.jsonURL: http://localhost:3000/d/mcp-hangar-overview

Provides high-level system health:

  • Request rate and error rate trends
  • Latency percentiles (P50, P95, P99)
  • Provider health status
  • Batch invocation success/failure rates
  • Health check results
  • GC cycle performance

Provider Details Dashboard

File: provider-details.jsonURL: http://localhost:3000/d/mcp-hangar-provider-details

Deep dive into individual providers:

  • Tool call breakdown by tool name
  • Per-tool latency histograms
  • Error distribution by type
  • Health check history
  • Consecutive failure tracking

Alerts Dashboard

File: alerts.jsonURL: http://localhost:3000/d/mcp-hangar-alerts

Alert monitoring and trends:

  • Active alerts by severity
  • Alert condition trends (error rate, latency, health)
  • Historical alert timeline

Importing Dashboards Manually

If not using provisioning:

  1. Open Grafana at http://localhost:3000
  2. Go to Dashboards > Import
  3. Upload JSON file from monitoring/grafana/dashboards/
  4. Select Prometheus data source
  5. Click Import

Alerting

Alert Configuration

Alert rules are defined in monitoring/prometheus/alerts.yaml and organized by severity:

Critical Alerts (Page On-Call)

AlertConditionForDescription
MCPHangarNotRespondingup{job="mcp-hangar"} == 01mService unreachable
MCPHangarHighErrorRateError rate > 10%2mSignificant failures
MCPHangarBatchHighFailureRateBatch failure > 20%3mBatch operations failing
MCPHangarCircuitBreakerTrippedCB rejections > 10/5m2mProvider isolated
MCPHangarProviderUnhealthyConsecutive failures > 52mProvider critically unhealthy

Warning Alerts (Investigate)

AlertConditionForDescription
MCPHangarHighConsecutiveFailuresConsecutive failures > 22mHealth check issues
MCPHangarHealthCheckSlowP95 health check > 5s5mSlow health checks
MCPHangarHighLatencyP95P95 latency > 3s5mPerformance degradation
MCPHangarHighLatencyP99P99 latency > 5s5mTail latency issues
MCPHangarHighLatencyByToolP95 per-tool > 5s5mSpecific tool slow
MCPHangarFrequentColdStartsStart rate > 0.1/s10mConsider increasing idle_ttl
MCPHangarBatchSlowExecutionP95 batch > 30s5mSlow batch processing
MCPHangarBatchHighCancellationRateCancellation > 10%5mBatches timing out
MCPHangarBatchSizeTooLargeP95 size > 505mConsider smaller batches
MCPHangarGCSlowCyclesP95 GC > 0.5s5mGC performance issue
MCPHangarHighMemoryUsageMemory > 2GB10mMemory pressure
MCPHangarHighCPUUsageCPU > 80%10mCPU saturation

Info Alerts (Tracking)

AlertConditionDescription
MCPHangarProviderStartedAny provider startProvider lifecycle event
MCPHangarHighToolCallVolumeRate > 100/sHigh traffic notification

Alertmanager Configuration

Configure notification routing in monitoring/alertmanager/alertmanager.yaml:

yaml
route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'
    - match:
        severity: warning
      receiver: 'slack'

receivers:
  - name: 'default'
    webhook_configs:
      - url: 'http://your-webhook-endpoint'

  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: '<your-service-key>'

  - name: 'slack'
    slack_configs:
      - api_url: '<your-slack-webhook-url>'
        channel: '#mcp-hangar-alerts'
        title: '{{ .CommonAnnotations.summary }}'
        text: '{{ .CommonAnnotations.description }}'

Testing Alerts

Verify alert rules are loaded:

bash
# Check Prometheus rules
curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[].name'

# Check for firing alerts
curl -s http://localhost:9090/api/v1/alerts | jq '.data.alerts[] | select(.state=="firing")'

Tracing

OpenTelemetry Integration

MCP Hangar supports distributed tracing via OpenTelemetry:

python
from mcp_hangar.observability import init_tracing, trace_span

# Initialize once at startup
init_tracing(
    service_name="mcp-hangar",
    otlp_endpoint="http://localhost:4317",
)

# Create spans for operations
with trace_span("process_request", {"request.id": req_id}) as span:
    span.add_event("checkpoint_reached")
    result = do_work()

Environment Variables

VariableDefaultDescription
MCP_TRACING_ENABLEDtrueEnable/disable tracing
OTEL_EXPORTER_OTLP_ENDPOINThttp://localhost:4317OTLP collector endpoint
OTEL_SERVICE_NAMEmcp-hangarService name in traces

Trace Context Propagation

python
from mcp_hangar.observability import inject_trace_context, extract_trace_context

# Inject into outgoing requests
headers = {}
inject_trace_context(headers)

# Extract from incoming requests
context = extract_trace_context(request_headers)

Langfuse Integration

MCP Hangar integrates with Langfuse for LLM-specific observability.

Configuration

bash
export MCP_LANGFUSE_ENABLED=true
export LANGFUSE_PUBLIC_KEY=pk-lf-...
export LANGFUSE_SECRET_KEY=sk-lf-...
export LANGFUSE_HOST=https://cloud.langfuse.com

Or via config.yaml:

yaml
observability:
  langfuse:
    enabled: true
    public_key: ${LANGFUSE_PUBLIC_KEY}
    secret_key: ${LANGFUSE_SECRET_KEY}
    host: https://cloud.langfuse.com
    sample_rate: 1.0

Trace Propagation

python
from mcp_hangar.application.services import TracedProviderService

result = traced_service.invoke_tool(
    provider_id="math",
    tool_name="add",
    arguments={"a": 1, "b": 2},
    trace_id="your-langfuse-trace-id",
    user_id="user-123",
    session_id="session-456",
)

See ADR-001 for architectural details.

Logging

Structured Logging

MCP Hangar uses structlog for structured JSON logging:

json
{
  "timestamp": "2026-02-03T10:30:00.123Z",
  "level": "info",
  "event": "tool_invoked",
  "provider": "math",
  "tool": "add",
  "duration_ms": 150,
  "service": "mcp-hangar"
}

Configuration

yaml
logging:
  level: INFO          # DEBUG, INFO, WARNING, ERROR
  json_format: true    # JSON output for log aggregation

Environment variable:

bash
MCP_LOG_LEVEL=DEBUG mcp-hangar serve --http

Log Correlation

Include trace IDs for correlation with distributed traces:

python
from mcp_hangar.observability import get_current_trace_id
from mcp_hangar.logging_config import get_logger

logger = get_logger(__name__)
logger.info("processing", trace_id=get_current_trace_id())

Health Checks

HTTP Endpoints

EndpointPurposeUse Case
/health/liveLivenessContainer restart decisions
/health/readyReadinessTraffic routing
/health/startupStartupInitial boot gate

Response Format

json
{
  "status": "healthy",
  "checks": [
    {
      "name": "providers",
      "status": "healthy",
      "duration_ms": 1.2
    }
  ],
  "version": "0.6.3",
  "uptime_seconds": 3600.5
}

Kubernetes Configuration

yaml
livenessProbe:
  httpGet:
    path: /health/live
    port: 8000
  initialDelaySeconds: 5
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /health/ready
    port: 8000
  initialDelaySeconds: 10
  periodSeconds: 5

SLIs/SLOs

Service Level Indicators

SLIMetricMeasurement
AvailabilityService upup{job="mcp-hangar"}
LatencyTool call durationP95 < 3s
Error RateFailed invocationsError rate < 1%
Batch SuccessBatch completionSuccess rate > 95%
SLITargetWindow
Availability99.9%30 days
Latency (P95)< 3s5 minutes
Error Rate< 1%5 minutes
Batch Success> 95%5 minutes

PromQL Queries

promql
# Availability (service up ratio over 30d)
avg_over_time(up{job="mcp-hangar"}[30d])

# Error budget remaining
1 - (
  sum(increase(mcp_hangar_tool_call_errors_total[30d]))
  / sum(increase(mcp_hangar_tool_calls_total[30d]))
) / 0.01

# P95 latency
histogram_quantile(0.95,
  sum(rate(mcp_hangar_tool_call_duration_seconds_bucket[5m])) by (le)
)

# Batch success rate
sum(rate(mcp_hangar_batch_calls_total{result="success"}[5m]))
/ sum(rate(mcp_hangar_batch_calls_total[5m]))

Troubleshooting

Metrics Not Visible

  1. Verify endpoint:

    bash
    curl http://localhost:8000/metrics | head -20
  2. Check Prometheus targets at http://localhost:9090/targets

  3. Verify network connectivity (use host.docker.internal for Docker on Mac/Windows)

Alerts Not Firing

  1. Check alert rules loaded:

    bash
    curl http://localhost:9090/api/v1/rules | jq '.data.groups[].name'
  2. Verify metrics exist for alert expressions

  3. Check Alertmanager connectivity:

    bash
    curl http://localhost:9093/api/v1/status

High Consecutive Failures

If MCPHangarHighConsecutiveFailures fires:

  1. Check provider logs for errors

  2. Verify provider command/configuration

  3. Test provider manually:

    bash
    mcp-hangar provider start <provider-id>

Provider Start Errors

Common patterns and fixes:

ErrorCauseFix
ModuleNotFoundErrorMissing dependencypip install <package>
FileNotFoundErrorWrong pathCheck command in config
PermissionErrorNot executablechmod +x <script>
Exit code 137OOM killedIncrease memory limits

Best Practices

Metrics

  1. Monitor the right things - Focus on user-facing SLIs
  2. Set appropriate retention - 15 days for metrics, 7 days for traces
  3. Avoid high cardinality - Don't use unbounded values as labels

Alerting

  1. Create runbooks - Document response procedures
  2. Start conservative - Tune thresholds based on baseline
  3. Test regularly - Verify notification channels work
  4. Use severity correctly - Critical = page, Warning = ticket

Dashboards

  1. Layer information - Overview -> Details -> Debug
  2. Include time selectors - Allow drilling into incidents
  3. Add annotations - Mark deployments and incidents

Production Readiness Checklist

  • [ ] Prometheus scraping MCP Hangar metrics
  • [ ] Grafana dashboards imported and working
  • [ ] Alertmanager configured with notification routes
  • [ ] Critical alerts tested (e.g., stop service, verify page)
  • [ ] Runbooks created for each alert
  • [ ] Log aggregation configured (ELK, Loki, etc.)
  • [ ] Tracing enabled and traces visible in Jaeger/Langfuse

Released under the MIT License.