Skip to content

Self-Healing

MeshStor automatically recovers from network failures, node outages, and drive loss. No administrator intervention is required. With replicaCount>=2, volumes continue serving reads and writes from the remaining replicas while recovery happens in the background.

How It Works

A reconciliation loop runs on every node every 10 seconds. Each cycle:

  1. Detects faulty or missing MD RAID members
  2. Attempts to reconnect lost NVMe-oF connections
  3. Re-adds recovered partitions to the MD array
  4. Requests replacement partitions for members that have been missing too long

Network Failure Timeline

When a remote node becomes unreachable:

Time What Happens
0s Network connection drops
1s NVMe keep-alive probe fails, I/O errors begin (fast_io_fail_tmo=1s). Fast detection lets MD RAID switch to serving I/O from the remaining replicas instead of stalling on an unresponsive device.
3s NVMe controller removed from kernel (ctrl_loss_tmo=3s), block device disappears. The short timeout prevents the kernel from holding onto an unresponsive controller.
~10s Next Probe cycle. MD detects the missing member and removes it from the array. With replicaCount>=2, the volume continues in degraded mode — reads and writes use the remaining replicas. With replicaCount=1, MD stops because no replicas remain, and the pod is evicted so Kubernetes can reschedule it.
every ~10s Reconciler attempts to reconnect NVMe-oF to the lost node.
on recovery NVMe reconnects, partition re-added to MD. MD uses the write-intent bitmap for fast incremental resync (only blocks written during the outage).
15 min (default) If still missing: partition marked for deletion, replacement requested on a healthy node. The timeout avoids replacing partitions that would have recovered on their own — a replacement triggers a full resync which is more expensive than waiting. Configurable via memberMissingTimeout.

Note

MeshStor always places one replica on the consumer node itself. Because the local data path doesn't traverse the network, a remote node failure or network partition doesn't interrupt I/O on the consumer. With replicaCount>=2, the volume stays readable and writable throughout the timeline above — applications see no interruption as long as at least one replica remains active, and the local replica is normally that one.

Warning

With replicaCount=1, any member loss makes the volume unavailable. The MD array is stopped and the pod is evicted so that Kubernetes can reschedule it. Network-related unavailability only applies when the current node lacks enough local storage to migrate partition from a remote node to local. When the partition is local, network failures do not affect the volume.

Recovery vs. Replacement

MeshStor distinguishes between two scenarios:

Automatic Recovery (node comes back before timeout)

sequenceDiagram
    participant R as Reconciler
    participant MD as MD RAID
    participant N as NVMe-oF

    R->>N: Attempt reconnect (every Probe)
    N-->>R: Connection restored
    R->>MD: mdadm --re-add (bitmap resync)
    MD-->>R: Syncing... Synced
  • With replicaCount>=2: the original partition is re-added to the running array (mdadm --re-add). MD uses its write-intent bitmap for incremental resync — only blocks written during the outage are rebuilt. Resync is fast (seconds to minutes depending on write volume during the outage).

Automatic Replacement (node stays down past timeout)

sequenceDiagram
    participant R as Reconciler
    participant K as Kubernetes API
    participant New as New Node

    R->>K: Transition Missing → Deleting
    R->>K: Select best available node, request new partition
    New->>New: Create partition, export via NVMe-oF
    R->>R: Import new partition, mdadm --add
    R->>R: Full resync from remaining members
    R->>K: Transition Syncing → Synced
  • A new node is selected based on RDMA connectivity, free space, and latency
  • Nodes that already hold a partition for this volume are excluded — this maintains fault isolation so a single node failure cannot take out multiple replicas
  • Full resync is required (no bitmap — this is a brand-new partition with no prior data)

Partition State Transitions

Synced ─→ Missing/Faulty ─→ (recovered)  ─→ Syncing ─→ Synced
                          ─→ (timed out)  ─→ Deleting ─→ removed
Synced ─→ Replacing ─→ Deleting ─→ removed              (graceful drain swap)
(new)   ─→ Requested ─→ Created ─→ Spare ─→ Syncing ─→ Synced
State Meaning
Synced Partition is healthy and in sync
Missing Partition was reachable but is not anymore. Recovery is attempted each Probe cycle.
Faulty MD has flagged the member faulty (write error or controller loss). Treated like Missing for replacement; resync after re-add is incremental as long as the bitmap is intact.
Syncing Partition is being rebuilt (recovery or replacement)
Spare Partition has been attached to the MD array but is not yet syncing. Brief — the reconciler promotes a spare into a replacement as soon as one appears.
Replacing An mdadm --replace swap is in progress: a spare member is taking over from this one
Deleting Partition has timed out or been replaced and is being cleaned up
Requested Replacement partition requested on a new node, not yet created
Created Replacement partition exists on the new node, waiting to be imported

Observing Self-Healing

Watch volume state in real time:

kubectl get msvol -w
NAME               PHASE     MDSTATE      READY   DEGRADED   SYNC    NODE       AGE
pvc-cd1038a7-...   Synced    active       2/2     0                  mf-01-02   1h
pvc-cd1038a7-...   Synced    active       1/2     1                  mf-01-02   1h
pvc-cd1038a7-...   Syncing   recovering   1/2     1          23.4%   mf-01-02   1h
pvc-cd1038a7-...   Synced    active       2/2     0                  mf-01-02   1h

Inspect individual partition states:

kubectl get msvol my-volume -o jsonpath='{range .status.partitions[*]}{.nodeID}{"\t"}{.state}{"\t"}{.updatedAt}{"\n"}{end}'

Tuning

memberMissingTimeout

Controls how long a partition stays in Missing before replacement begins.

Value Trade-off
900s (default, 15 min) Tolerates long outages without unnecessary rebuilds
120s (2 min) Faster recovery, but brief network blips trigger full rebuilds
60s (minimum) Aggressive — use only when fast replacement is critical and network is stable

Set per volume:

kubectl patch msvol my-volume --type=merge \
  -p '{"spec":{"memberMissingTimeout":120}}'

Set for all new volumes via StorageClass, see the memberMissingTimeout parameter reference.

parameters:
  memberMissingTimeout: "120"

Warning

Lower timeouts increase the chance of unnecessary replacements. Recovery (bitmap resync) is much cheaper than replacement (full resync), so it is better to wait for a node to come back than to replace it prematurely. The default 15 minutes is a safe choice for most environments.

replicaCount

Higher replica counts provide more resilience:

replicaCount Survives Recovery behavior
1 Relocations only (no redundancy) Volume unavailable during outage. Recovers when node with storage returns.
2 1 node failure Reads and writes continue from the remaining replica, then recovery or replacement
3 2 simultaneous node failures Reads and writes continue from the remaining replicas, then recovery or replacement

What's Next