Self-Healing¶

MeshStor automatically recovers from network failures, node outages, and drive loss. No administrator intervention is required. With replicaCount>=2, volumes continue serving reads and writes from the remaining replicas while recovery happens in the background.

How It Works¶

A reconciliation loop runs on every node every 10 seconds. Each cycle:

Detects faulty or missing MD RAID members
Attempts to reconnect lost NVMe-oF connections
Re-adds recovered partitions to the MD array
Requests replacement partitions for members that have been missing too long

Network Failure Timeline¶

When a remote node becomes unreachable:

Time	What Happens
0s	Network connection drops
1s	NVMe keep-alive probe fails, I/O errors begin (`fast_io_fail_tmo=1s`). Fast detection lets MD RAID switch to serving I/O from the remaining replicas instead of stalling on an unresponsive device.
3s	NVMe controller removed from kernel (`ctrl_loss_tmo=3s`), block device disappears. The short timeout prevents the kernel from holding onto an unresponsive controller.
~10s	Next Probe cycle. MD detects the missing member and removes it from the array. With `replicaCount>=2`, the volume continues in degraded mode — reads and writes use the remaining replicas. With `replicaCount=1`, MD stops because no replicas remain, and the pod is evicted so Kubernetes can reschedule it.
every ~10s	Reconciler attempts to reconnect NVMe-oF to the lost node.
on recovery	NVMe reconnects, partition re-added to MD. MD uses the write-intent bitmap for fast incremental resync (only blocks written during the outage).
15 min (default)	If still missing: partition marked for deletion, replacement requested on a healthy node. The timeout avoids replacing partitions that would have recovered on their own — a replacement triggers a full resync which is more expensive than waiting. Configurable via `memberMissingTimeout`.

Note

MeshStor always places one replica on the consumer node itself. Because the local data path doesn't traverse the network, a remote node failure or network partition doesn't interrupt I/O on the consumer. With replicaCount>=2, the volume stays readable and writable throughout the timeline above — applications see no interruption as long as at least one replica remains active, and the local replica is normally that one.

Warning

With replicaCount=1, any member loss makes the volume unavailable. The MD array is stopped and the pod is evicted so that Kubernetes can reschedule it. Network-related unavailability only applies when the current node lacks enough local storage to migrate partition from a remote node to local. When the partition is local, network failures do not affect the volume.

Recovery vs. Replacement¶

MeshStor distinguishes between two scenarios:

Automatic Recovery (node comes back before timeout)¶

sequenceDiagram
    participant R as Reconciler
    participant MD as MD RAID
    participant N as NVMe-oF

    R->>N: Attempt reconnect (every Probe)
    N-->>R: Connection restored
    R->>MD: mdadm --re-add (bitmap resync)
    MD-->>R: Syncing... Synced

With replicaCount>=2: the original partition is re-added to the running array (mdadm --re-add). MD uses its write-intent bitmap for incremental resync — only blocks written during the outage are rebuilt. Resync is fast (seconds to minutes depending on write volume during the outage).

Automatic Replacement (node stays down past timeout)¶

sequenceDiagram
    participant R as Reconciler
    participant K as Kubernetes API
    participant New as New Node

    R->>K: Transition Missing → Deleting
    R->>K: Select best available node, request new partition
    New->>New: Create partition, export via NVMe-oF
    R->>R: Import new partition, mdadm --add
    R->>R: Full resync from remaining members
    R->>K: Transition Syncing → Synced

A new node is selected based on RDMA connectivity, free space, and latency
Nodes that already hold a partition for this volume are excluded — this maintains fault isolation so a single node failure cannot take out multiple replicas
Full resync is required (no bitmap — this is a brand-new partition with no prior data)

Partition State Transitions¶

Synced ─→ Missing/Faulty ─→ (recovered)  ─→ Syncing ─→ Synced
                          ─→ (timed out)  ─→ Deleting ─→ removed
Synced ─→ Replacing ─→ Deleting ─→ removed              (graceful drain swap)
(new)   ─→ Requested ─→ Created ─→ Spare ─→ Syncing ─→ Synced

State	Meaning
`Synced`	Partition is healthy and in sync
`Missing`	Partition was reachable but is not anymore. Recovery is attempted each Probe cycle.
`Faulty`	MD has flagged the member faulty (write error or controller loss). Treated like `Missing` for replacement; resync after re-add is incremental as long as the bitmap is intact.
`Syncing`	Partition is being rebuilt (recovery or replacement)
`Spare`	Partition has been attached to the MD array but is not yet syncing. Brief — the reconciler promotes a spare into a replacement as soon as one appears.
`Replacing`	An `mdadm --replace` swap is in progress: a spare member is taking over from this one
`Deleting`	Partition has timed out or been replaced and is being cleaned up
`Requested`	Replacement partition requested on a new node, not yet created
`Created`	Replacement partition exists on the new node, waiting to be imported

Observing Self-Healing¶

Watch volume state in real time:

kubectl get msvol -w

NAME               PHASE     MDSTATE      READY   DEGRADED   SYNC    NODE       AGE
pvc-cd1038a7-...   Synced    active       2/2     0                  mf-01-02   1h
pvc-cd1038a7-...   Synced    active       1/2     1                  mf-01-02   1h
pvc-cd1038a7-...   Syncing   recovering   1/2     1          23.4%   mf-01-02   1h
pvc-cd1038a7-...   Synced    active       2/2     0                  mf-01-02   1h

Inspect individual partition states:

kubectl get msvol my-volume -o jsonpath='{range .status.partitions[*]}{.nodeID}{"\t"}{.state}{"\t"}{.updatedAt}{"\n"}{end}'

Tuning¶

memberMissingTimeout¶

Controls how long a partition stays in Missing before replacement begins.

Value	Trade-off
900s (default, 15 min)	Tolerates long outages without unnecessary rebuilds
120s (2 min)	Faster recovery, but brief network blips trigger full rebuilds
60s (minimum)	Aggressive — use only when fast replacement is critical and network is stable

Set per volume:

kubectl patch msvol my-volume --type=merge \
  -p '{"spec":{"memberMissingTimeout":120}}'

Set for all new volumes via StorageClass, see the memberMissingTimeout parameter reference.

parameters:
  memberMissingTimeout: "120"

Warning

Lower timeouts increase the chance of unnecessary replacements. Recovery (bitmap resync) is much cheaper than replacement (full resync), so it is better to wait for a node to come back than to replace it prematurely. The default 15 minutes is a safe choice for most environments.

replicaCount¶

Higher replica counts provide more resilience:

replicaCount	Survives	Recovery behavior
1	Relocations only (no redundancy)	Volume unavailable during outage. Recovers when node with storage returns.
2	1 node failure	Reads and writes continue from the remaining replica, then recovery or replacement
3	2 simultaneous node failures	Reads and writes continue from the remaining replicas, then recovery or replacement

What's Next¶

Replication — how MD RAID replication works
Volume Relocation — how volumes migrate during node drain
Volume Expansion — grow a volume online without downtime
Monitoring — proactive health tracking