Self-Healing¶
MeshStor automatically recovers from network failures, node outages, and drive loss. No administrator intervention is required. With replicaCount>=2, volumes continue serving reads and writes from the remaining replicas while recovery happens in the background.
How It Works¶
A reconciliation loop runs on every node every 10 seconds. Each cycle:
- Detects faulty or missing MD RAID members
- Attempts to reconnect lost NVMe-oF connections
- Re-adds recovered partitions to the MD array
- Requests replacement partitions for members that have been missing too long
Network Failure Timeline¶
When a remote node becomes unreachable:
| Time | What Happens |
|---|---|
| 0s | Network connection drops |
| 1s | NVMe keep-alive probe fails, I/O errors begin (fast_io_fail_tmo=1s). Fast detection lets MD RAID switch to serving I/O from the remaining replicas instead of stalling on an unresponsive device. |
| 3s | NVMe controller removed from kernel (ctrl_loss_tmo=3s), block device disappears. The short timeout prevents the kernel from holding onto an unresponsive controller. |
| ~10s | Next Probe cycle. MD detects the missing member and removes it from the array. With replicaCount>=2, the volume continues in degraded mode — reads and writes use the remaining replicas. With replicaCount=1, MD stops because no replicas remain, and the pod is evicted so Kubernetes can reschedule it. |
| every ~10s | Reconciler attempts to reconnect NVMe-oF to the lost node. |
| on recovery | NVMe reconnects, partition re-added to MD. MD uses the write-intent bitmap for fast incremental resync (only blocks written during the outage). |
| 15 min (default) | If still missing: partition marked for deletion, replacement requested on a healthy node. The timeout avoids replacing partitions that would have recovered on their own — a replacement triggers a full resync which is more expensive than waiting. Configurable via memberMissingTimeout. |
Note
MeshStor always places one replica on the consumer node itself. Because the local data path doesn't traverse the network, a remote node failure or network partition doesn't interrupt I/O on the consumer. With replicaCount>=2, the volume stays readable and writable throughout the timeline above — applications see no interruption as long as at least one replica remains active, and the local replica is normally that one.
Warning
With replicaCount=1, any member loss makes the volume unavailable. The MD array is stopped and the pod is evicted so that Kubernetes can reschedule it. Network-related unavailability only applies when the current node lacks enough local storage to migrate partition from a remote node to local. When the partition is local, network failures do not affect the volume.
Recovery vs. Replacement¶
MeshStor distinguishes between two scenarios:
Automatic Recovery (node comes back before timeout)¶
sequenceDiagram
participant R as Reconciler
participant MD as MD RAID
participant N as NVMe-oF
R->>N: Attempt reconnect (every Probe)
N-->>R: Connection restored
R->>MD: mdadm --re-add (bitmap resync)
MD-->>R: Syncing... Synced
- With
replicaCount>=2: the original partition is re-added to the running array (mdadm --re-add). MD uses its write-intent bitmap for incremental resync — only blocks written during the outage are rebuilt. Resync is fast (seconds to minutes depending on write volume during the outage).
Automatic Replacement (node stays down past timeout)¶
sequenceDiagram
participant R as Reconciler
participant K as Kubernetes API
participant New as New Node
R->>K: Transition Missing → Deleting
R->>K: Select best available node, request new partition
New->>New: Create partition, export via NVMe-oF
R->>R: Import new partition, mdadm --add
R->>R: Full resync from remaining members
R->>K: Transition Syncing → Synced
- A new node is selected based on RDMA connectivity, free space, and latency
- Nodes that already hold a partition for this volume are excluded — this maintains fault isolation so a single node failure cannot take out multiple replicas
- Full resync is required (no bitmap — this is a brand-new partition with no prior data)
Partition State Transitions¶
Synced ─→ Missing/Faulty ─→ (recovered) ─→ Syncing ─→ Synced
─→ (timed out) ─→ Deleting ─→ removed
Synced ─→ Replacing ─→ Deleting ─→ removed (graceful drain swap)
(new) ─→ Requested ─→ Created ─→ Spare ─→ Syncing ─→ Synced
| State | Meaning |
|---|---|
Synced |
Partition is healthy and in sync |
Missing |
Partition was reachable but is not anymore. Recovery is attempted each Probe cycle. |
Faulty |
MD has flagged the member faulty (write error or controller loss). Treated like Missing for replacement; resync after re-add is incremental as long as the bitmap is intact. |
Syncing |
Partition is being rebuilt (recovery or replacement) |
Spare |
Partition has been attached to the MD array but is not yet syncing. Brief — the reconciler promotes a spare into a replacement as soon as one appears. |
Replacing |
An mdadm --replace swap is in progress: a spare member is taking over from this one |
Deleting |
Partition has timed out or been replaced and is being cleaned up |
Requested |
Replacement partition requested on a new node, not yet created |
Created |
Replacement partition exists on the new node, waiting to be imported |
Observing Self-Healing¶
Watch volume state in real time:
NAME PHASE MDSTATE READY DEGRADED SYNC NODE AGE
pvc-cd1038a7-... Synced active 2/2 0 mf-01-02 1h
pvc-cd1038a7-... Synced active 1/2 1 mf-01-02 1h
pvc-cd1038a7-... Syncing recovering 1/2 1 23.4% mf-01-02 1h
pvc-cd1038a7-... Synced active 2/2 0 mf-01-02 1h
Inspect individual partition states:
kubectl get msvol my-volume -o jsonpath='{range .status.partitions[*]}{.nodeID}{"\t"}{.state}{"\t"}{.updatedAt}{"\n"}{end}'
Tuning¶
memberMissingTimeout¶
Controls how long a partition stays in Missing before replacement begins.
| Value | Trade-off |
|---|---|
| 900s (default, 15 min) | Tolerates long outages without unnecessary rebuilds |
| 120s (2 min) | Faster recovery, but brief network blips trigger full rebuilds |
| 60s (minimum) | Aggressive — use only when fast replacement is critical and network is stable |
Set per volume:
Set for all new volumes via StorageClass, see the memberMissingTimeout parameter reference.
Warning
Lower timeouts increase the chance of unnecessary replacements. Recovery (bitmap resync) is much cheaper than replacement (full resync), so it is better to wait for a node to come back than to replace it prematurely. The default 15 minutes is a safe choice for most environments.
replicaCount¶
Higher replica counts provide more resilience:
| replicaCount | Survives | Recovery behavior |
|---|---|---|
| 1 | Relocations only (no redundancy) | Volume unavailable during outage. Recovers when node with storage returns. |
| 2 | 1 node failure | Reads and writes continue from the remaining replica, then recovery or replacement |
| 3 | 2 simultaneous node failures | Reads and writes continue from the remaining replicas, then recovery or replacement |
What's Next¶
- Replication — how MD RAID replication works
- Volume Relocation — how volumes migrate during node drain
- Volume Expansion — grow a volume online without downtime
- Monitoring — proactive health tracking