Skip to content

MCP Server Groups

Aggregate multiple MCP servers behind a single virtual MCP server with load balancing, health tracking, and circuit breaker protection.

Overview

MCP Server Groups allow you to treat multiple MCP servers as a single logical unit. MCP clients interact with the group as if it were one MCP server -- the group handles member selection, health monitoring, and failover automatically.

Use groups when you need:

  • High availability -- If one MCP server fails, requests route to healthy members
  • Load distribution -- Spread requests across multiple MCP servers using configurable strategies
  • Failover -- Designate primary and backup MCP servers with priority-based routing
  • Capacity scaling -- Add members to increase throughput without changing client configuration

Group States

StateConditionAccepts Requests
inactive0 healthy membersNo
partialhealthy members < min_healthyYes (if circuit closed and healthy >= 1)
healthyhealthy members >= min_healthyYes
degradedCircuit breaker openNo

Configuration

Groups are defined in config.yaml alongside regular MCP servers. Set mode: group to create a group.

yaml
mcp_servers:
  llm-pool:
    mode: group
    strategy: round_robin
    min_healthy: 1
    auto_start: true
    description: "LLM mcp_server pool with failover"
    members:
      - id: llm-1
        mode: subprocess
        command: [python, -m, llm_server]
      - id: llm-2
        mode: subprocess
        command: [python, -m, llm_server]
KeyTypeDefaultDescription
modestr--Must be "group"
strategystr"round_robin"Load balancing strategy
min_healthyint1Minimum healthy members for healthy state
auto_startbooltrueAuto-start members when the group is added
descriptionstr--Human-readable description
memberslist[dict][]Member MCP server configurations

Each member entry accepts the same keys as a regular MCP server (mode, command, image, endpoint, env, idle_ttl_s, etc.) plus group-specific keys:

KeyTypeDefaultRangeDescription
idstrrequired--Unique member identifier
weightint501-100Weight for weighted strategies
priorityint501-100Priority for priority strategy (lower = higher priority)

For the full YAML schema, see the Configuration Reference.

Load Balancing Strategies

Round Robin

Distributes requests sequentially across all healthy members. Each member receives an equal share of traffic.

yaml
mcp_servers:
  api-pool:
    mode: group
    strategy: round_robin
    min_healthy: 1
    members:
      - id: api-1
        mode: subprocess
        command: [python, -m, api_server]
      - id: api-2
        mode: subprocess
        command: [python, -m, api_server]
      - id: api-3
        mode: subprocess
        command: [python, -m, api_server]

Requests cycle through members in order: api-1, api-2, api-3, api-1, api-2, ... Unhealthy members are skipped. No weight or priority configuration applies.

Choose round robin when all members have similar capacity and you want even distribution.

Weighted Round Robin

Distributes requests proportionally based on member weights using the Nginx smooth weighted round-robin algorithm. Higher weight means more requests.

yaml
mcp_servers:
  compute-pool:
    mode: group
    strategy: weighted_round_robin
    min_healthy: 1
    members:
      - id: large-instance
        mode: remote
        endpoint: https://large.example.com/mcp
        weight: 80
      - id: small-instance
        mode: remote
        endpoint: https://small.example.com/mcp
        weight: 20

With weights 80 and 20, large-instance receives approximately 4 out of every 5 requests. The smooth weighted algorithm avoids bursts -- requests interleave rather than sending 4 consecutive requests to one member.

Choose weighted round robin when members have different capacities (e.g., different hardware, instance sizes).

Least Connections

Selects the member with the oldest last_selected_at timestamp, effectively routing to the least recently used member. This approximates least-connections behavior by distributing requests to the member that has been idle the longest.

yaml
mcp_servers:
  db-pool:
    mode: group
    strategy: least_connections
    min_healthy: 2
    members:
      - id: db-reader-1
        mode: remote
        endpoint: https://db1.example.com/mcp
      - id: db-reader-2
        mode: remote
        endpoint: https://db2.example.com/mcp
      - id: db-reader-3
        mode: remote
        endpoint: https://db3.example.com/mcp

No weight or priority configuration applies. When multiple members have the same timestamp, the first healthy member is selected.

Choose least connections when requests have variable duration and you want to avoid overloading a member that is still processing a long request.

Random

Selects a random healthy member using weighted probability. Members with higher weight have a proportionally higher chance of being selected.

yaml
mcp_servers:
  search-pool:
    mode: group
    strategy: random
    min_healthy: 1
    members:
      - id: search-primary
        mode: subprocess
        command: [python, -m, search_server]
        weight: 70
      - id: search-secondary
        mode: subprocess
        command: [python, -m, search_server]
        weight: 30

With weights 70 and 30, search-primary has a 70% probability of being selected per request. Unlike round robin, there is no guaranteed ordering -- consecutive requests may go to the same member.

Choose random when you want simple probabilistic distribution without the overhead of tracking request order.

Priority

Selects the healthy member with the lowest priority number. This creates a primary/backup pattern where backup members only receive traffic when higher-priority members are unavailable.

yaml
mcp_servers:
  llm-failover:
    mode: group
    strategy: priority
    min_healthy: 1
    members:
      - id: local-llm
        mode: subprocess
        command: [python, -m, local_llm]
        priority: 1
      - id: cloud-llm
        mode: remote
        endpoint: https://llm-api.example.com/mcp
        priority: 50
      - id: fallback-llm
        mode: remote
        endpoint: https://fallback.example.com/mcp
        priority: 99

All requests go to local-llm (priority 1) while it is healthy. If local-llm becomes unhealthy, requests route to cloud-llm (priority 50). If both are down, fallback-llm (priority 99) handles traffic. When local-llm recovers and passes health checks, it resumes as the primary.

Choose priority when you have a preferred MCP server and want others to serve only as backups.

Health Policy

The group tracks each member's health independently based on consecutive successes and failures.

ParameterDefaultDescription
health.unhealthy_threshold2Consecutive failures before a member is removed from rotation
health.healthy_threshold1Consecutive successes before a member is re-added to rotation
yaml
mcp_servers:
  resilient-pool:
    mode: group
    strategy: round_robin
    min_healthy: 2
    health:
      unhealthy_threshold: 3
      healthy_threshold: 2
    members:
      - id: worker-1
        mode: subprocess
        command: [python, -m, worker]
      - id: worker-2
        mode: subprocess
        command: [python, -m, worker]
      - id: worker-3
        mode: subprocess
        command: [python, -m, worker]

Removal and Re-entry Flow

  1. A member starts in rotation (healthy)
  2. Each failed health check or invocation error increments consecutive_failures
  3. When consecutive_failures >= unhealthy_threshold, the member is removed from rotation
  4. While removed, the member continues to receive health checks
  5. Each successful health check increments consecutive_successes and resets consecutive_failures
  6. When consecutive_successes >= healthy_threshold AND the MCP server state is READY, the member re-enters rotation

!!! note A member must reach the READY MCP server state to re-enter rotation. Health check successes alone are not sufficient -- the underlying MCP server process must be fully initialized.

The hangar_group_rebalance tool can be used to manually trigger a health re-evaluation of all members, re-adding recovered members and removing failed ones.

Circuit Breaker

The group-level circuit breaker protects against cascading failures by halting all requests when the total failure count exceeds a threshold.

ParameterDefaultDescription
circuit_breaker.failure_threshold10Total group failures before the circuit opens
circuit_breaker.reset_timeout_s60.0Seconds before the circuit auto-resets
yaml
mcp_servers:
  protected-pool:
    mode: group
    strategy: weighted_round_robin
    min_healthy: 1
    circuit_breaker:
      failure_threshold: 5
      reset_timeout_s: 30.0
    members:
      - id: svc-1
        mode: remote
        endpoint: https://svc1.example.com/mcp
        weight: 60
      - id: svc-2
        mode: remote
        endpoint: https://svc2.example.com/mcp
        weight: 40

Circuit Breaker States

CLOSED (normal operation)
   |
   | total failures >= failure_threshold
   v
OPEN (all requests rejected)
   |
   | reset_timeout_s elapses
   v
CLOSED (normal operation resumes)
  • CLOSED -- Normal operation. Requests are routed to healthy members. Each failure increments the failure counter.
  • OPEN -- All requests are rejected immediately (the group enters the degraded state). No member selection occurs.
  • Auto-reset -- After reset_timeout_s elapses, the next request attempt closes the circuit and resets the failure counter.

!!! warning The circuit breaker tracks total group failures, not per-member failures. A burst of errors from a single member can trip the breaker even if other members are healthy.

The hangar_group_rebalance tool resets the circuit breaker immediately, regardless of the timeout.

Tool Access Filtering

Tool access filtering controls which tools are visible when invoking a group or its members. Filters use a three-level policy hierarchy with fnmatch glob pattern matching (*, ?, [seq]).

Policy Hierarchy

  1. MCP Server-level -- Applied to the MCP server's own tool list
  2. Group-level -- Applied to the group as a whole
  3. Member-level -- Applied per member within the group

Configuration

yaml
mcp_servers:
  secure-pool:
    mode: group
    strategy: round_robin
    tools:
      allow_list: ["query_*", "search_*"]
      deny_list: []
    members:
      - id: full-access
        mode: subprocess
        command: [python, -m, data_server]
        tools:
          allow_list: []
          deny_list: ["admin_*"]
      - id: read-only
        mode: subprocess
        command: [python, -m, data_server]
        tools:
          allow_list: ["query_*"]
          deny_list: []

Individual MCP servers can also define tool access policies:

yaml
mcp_servers:
  restricted-mcp-server:
    mode: subprocess
    command: [python, -m, server]
    tools:
      allow_list: ["safe_*"]
      deny_list: []

Resolution Rules

ConditionBehavior
allow_list is set (non-empty)Only tools matching an allow pattern are visible
allow_list is empty, deny_list is setAll tools visible except those matching a deny pattern
Both emptyAll tools visible
Both setallow_list takes precedence; deny_list is ignored

Patterns use Python's fnmatch module:

  • * matches everything
  • ? matches any single character
  • [seq] matches any character in seq
  • [!seq] matches any character not in seq

MCP Hangar · Released under MIT License.