Skip to content

Self-Healing

MeshStor automatically recovers from network failures, node outages, and drive loss. No administrator intervention is required. With numberOfCopies>=2, volumes continue serving reads and writes from the remaining copies while recovery happens in the background.

How It Works

A reconciliation loop runs on every node every 10 seconds. Each cycle:

  1. Detects faulty or missing MD RAID members
  2. Attempts to reconnect lost NVMe-oF connections
  3. Re-adds recovered partitions to the MD array
  4. Requests replacement partitions for members that have been missing too long

Network Failure Timeline

When a remote node becomes unreachable:

Time What Happens
0s Network connection drops
1s NVMe keep-alive probe fails, I/O errors begin (fast_io_fail_tmo=1s). Fast detection lets MD RAID switch to serving I/O from the remaining copies instead of stalling on an unresponsive device.
3s NVMe controller removed from kernel (ctrl_loss_tmo=3s), block device disappears. The short timeout prevents the kernel from holding onto an unresponsive controller.
~10s Next Probe cycle. MD detects the missing member and removes it from the array. With numberOfCopies>=2, the volume continues in degraded mode — reads and writes use the remaining copies. With numberOfCopies=1, MD stops because no copies remain, and the pod is evicted so Kubernetes can reschedule it to a node where the data becomes available again.
every ~10s Reconciler attempts to reconnect NVMe-oF to the lost node.
on recovery NVMe reconnects, partition re-added to MD. MD uses the write-intent bitmap for fast incremental resync (only blocks written during the outage).
15 min (default) If still missing: partition marked for deletion, replacement requested on a healthy node. The timeout avoids replacing partitions that would have recovered on their own — a replacement triggers a full resync which is more expensive than waiting. Configurable via memberMissingTimeout.

Note

With numberOfCopies>=2, the volume stays readable and writable throughout this process — applications see no interruption as long as at least one copy remains active.

Warning

With numberOfCopies=1, any member loss makes the volume unavailable. The MD array is stopped and the pod is evicted so that Kubernetes can reschedule it. Network-related unavailability only applies when the current node lacks enough local storage to migrate partition from a remote node to local. When the partition is local, network failures do not affect the volume.

Recovery vs. Replacement

MeshStor distinguishes between two scenarios:

Automatic Recovery (node comes back before timeout)

sequenceDiagram
    participant R as Reconciler
    participant MD as MD RAID
    participant N as NVMe-oF

    R->>N: Attempt reconnect (every Probe)
    N-->>R: Connection restored
    R->>MD: mdadm --re-add (bitmap resync)
    MD-->>R: Syncing... Synced
  • With numberOfCopies>=2: the original partition is re-added to the running array (mdadm --re-add). MD uses its write-intent bitmap for incremental resync — only blocks written during the outage are rebuilt. Resync is fast (seconds to minutes depending on write volume during the outage).

Automatic Replacement (node stays down past timeout)

sequenceDiagram
    participant R as Reconciler
    participant K as Kubernetes API
    participant New as New Node

    R->>K: Transition Missing → Deleting
    R->>K: Select best available node, request new partition
    New->>New: Create partition, export via NVMe-oF
    R->>R: Import new partition, mdadm --add
    R->>R: Full resync from remaining members
    R->>K: Transition Syncing → Synced
  • A new node is selected based on RDMA connectivity, free space, and latency
  • Nodes that already hold a partition for this volume are excluded — this maintains fault isolation so a single node failure cannot take out multiple copies
  • Full resync is required (no bitmap — this is a brand-new partition with no prior data)

Partition State Transitions

Synced → Missing → (recovered)  → Syncing → Synced
                 → (timed out)  → Deleting → removed
                                  (new)    → Requested → Created → Syncing → Synced
State Meaning
Synced Partition is healthy and in sync
Missing Partition was reachable but is not anymore. Recovery is attempted each Probe cycle.
Syncing Partition is being rebuilt (recovery or replacement)
Deleting Partition timed out and is being cleaned up
Requested Replacement partition requested on a new node, not yet created
Created Replacement partition exists on the new node, waiting to be imported

Observing Self-Healing

Watch volume state in real time:

kubectl get msvol -w
NAME        PHASE     MDSTATE      TOTAL   ACTIVE   FAILED   DOWN   SYNC
my-volume   Synced    active       2       2        0        0
my-volume   Synced    degraded     2       1        0        1
my-volume   Syncing   recovering   2       1        0        1      23.4%
my-volume   Synced    active       2       2        0        0

Inspect individual partition states:

kubectl get msvol my-volume -o jsonpath='{range .status.partitions[*]}{.nodeID}{"\t"}{.state}{"\t"}{.updatedAt}{"\n"}{end}'

Tuning

memberMissingTimeout

Controls how long a partition stays in Missing before replacement begins.

Value Trade-off
900s (default, 15 min) Tolerates long outages without unnecessary rebuilds
120s (2 min) Faster recovery, but brief network blips trigger full rebuilds
60s (minimum) Aggressive — use only when fast replacement is critical and network is stable

Set per volume:

kubectl patch msvol my-volume --type=merge \
  -p '{"spec":{"memberMissingTimeout":120}}'

Set for all new volumes via StorageClass:

parameters:
  memberMissingTimeout: "120"

Warning

Lower timeouts increase the chance of unnecessary replacements. Recovery (bitmap resync) is much cheaper than replacement (full resync), so it is better to wait for a node to come back than to replace it prematurely. The default 15 minutes is a safe choice for most environments.

numberOfCopies

Higher copy counts provide more resilience:

numberOfCopies Survives Recovery behavior
1 Relocations only (no redundancy) Volume unavailable during outage. Recovers when node with storage returns.
2 1 node failure Reads and writes continue from the remaining copy, then recovery or replacement
3 2 simultaneous node failures Reads and writes continue from the remaining copies, then recovery or replacement

What's Next