Self-Healing¶
MeshStor automatically recovers from network failures, node outages, and drive loss. No administrator intervention is required. With numberOfCopies>=2, volumes continue serving reads and writes from the remaining copies while recovery happens in the background.
How It Works¶
A reconciliation loop runs on every node every 10 seconds. Each cycle:
- Detects faulty or missing MD RAID members
- Attempts to reconnect lost NVMe-oF connections
- Re-adds recovered partitions to the MD array
- Requests replacement partitions for members that have been missing too long
Network Failure Timeline¶
When a remote node becomes unreachable:
| Time | What Happens |
|---|---|
| 0s | Network connection drops |
| 1s | NVMe keep-alive probe fails, I/O errors begin (fast_io_fail_tmo=1s). Fast detection lets MD RAID switch to serving I/O from the remaining copies instead of stalling on an unresponsive device. |
| 3s | NVMe controller removed from kernel (ctrl_loss_tmo=3s), block device disappears. The short timeout prevents the kernel from holding onto an unresponsive controller. |
| ~10s | Next Probe cycle. MD detects the missing member and removes it from the array. With numberOfCopies>=2, the volume continues in degraded mode — reads and writes use the remaining copies. With numberOfCopies=1, MD stops because no copies remain, and the pod is evicted so Kubernetes can reschedule it to a node where the data becomes available again. |
| every ~10s | Reconciler attempts to reconnect NVMe-oF to the lost node. |
| on recovery | NVMe reconnects, partition re-added to MD. MD uses the write-intent bitmap for fast incremental resync (only blocks written during the outage). |
| 15 min (default) | If still missing: partition marked for deletion, replacement requested on a healthy node. The timeout avoids replacing partitions that would have recovered on their own — a replacement triggers a full resync which is more expensive than waiting. Configurable via memberMissingTimeout. |
Note
With numberOfCopies>=2, the volume stays readable and writable throughout this process — applications see no interruption as long as at least one copy remains active.
Warning
With numberOfCopies=1, any member loss makes the volume unavailable. The MD array is stopped and the pod is evicted so that Kubernetes can reschedule it. Network-related unavailability only applies when the current node lacks enough local storage to migrate partition from a remote node to local. When the partition is local, network failures do not affect the volume.
Recovery vs. Replacement¶
MeshStor distinguishes between two scenarios:
Automatic Recovery (node comes back before timeout)¶
sequenceDiagram
participant R as Reconciler
participant MD as MD RAID
participant N as NVMe-oF
R->>N: Attempt reconnect (every Probe)
N-->>R: Connection restored
R->>MD: mdadm --re-add (bitmap resync)
MD-->>R: Syncing... Synced
- With
numberOfCopies>=2: the original partition is re-added to the running array (mdadm --re-add). MD uses its write-intent bitmap for incremental resync — only blocks written during the outage are rebuilt. Resync is fast (seconds to minutes depending on write volume during the outage).
Automatic Replacement (node stays down past timeout)¶
sequenceDiagram
participant R as Reconciler
participant K as Kubernetes API
participant New as New Node
R->>K: Transition Missing → Deleting
R->>K: Select best available node, request new partition
New->>New: Create partition, export via NVMe-oF
R->>R: Import new partition, mdadm --add
R->>R: Full resync from remaining members
R->>K: Transition Syncing → Synced
- A new node is selected based on RDMA connectivity, free space, and latency
- Nodes that already hold a partition for this volume are excluded — this maintains fault isolation so a single node failure cannot take out multiple copies
- Full resync is required (no bitmap — this is a brand-new partition with no prior data)
Partition State Transitions¶
Synced → Missing → (recovered) → Syncing → Synced
→ (timed out) → Deleting → removed
(new) → Requested → Created → Syncing → Synced
| State | Meaning |
|---|---|
Synced |
Partition is healthy and in sync |
Missing |
Partition was reachable but is not anymore. Recovery is attempted each Probe cycle. |
Syncing |
Partition is being rebuilt (recovery or replacement) |
Deleting |
Partition timed out and is being cleaned up |
Requested |
Replacement partition requested on a new node, not yet created |
Created |
Replacement partition exists on the new node, waiting to be imported |
Observing Self-Healing¶
Watch volume state in real time:
NAME PHASE MDSTATE TOTAL ACTIVE FAILED DOWN SYNC
my-volume Synced active 2 2 0 0
my-volume Synced degraded 2 1 0 1
my-volume Syncing recovering 2 1 0 1 23.4%
my-volume Synced active 2 2 0 0
Inspect individual partition states:
kubectl get msvol my-volume -o jsonpath='{range .status.partitions[*]}{.nodeID}{"\t"}{.state}{"\t"}{.updatedAt}{"\n"}{end}'
Tuning¶
memberMissingTimeout¶
Controls how long a partition stays in Missing before replacement begins.
| Value | Trade-off |
|---|---|
| 900s (default, 15 min) | Tolerates long outages without unnecessary rebuilds |
| 120s (2 min) | Faster recovery, but brief network blips trigger full rebuilds |
| 60s (minimum) | Aggressive — use only when fast replacement is critical and network is stable |
Set per volume:
Set for all new volumes via StorageClass:
Warning
Lower timeouts increase the chance of unnecessary replacements. Recovery (bitmap resync) is much cheaper than replacement (full resync), so it is better to wait for a node to come back than to replace it prematurely. The default 15 minutes is a safe choice for most environments.
numberOfCopies¶
Higher copy counts provide more resilience:
| numberOfCopies | Survives | Recovery behavior |
|---|---|---|
| 1 | Relocations only (no redundancy) | Volume unavailable during outage. Recovers when node with storage returns. |
| 2 | 1 node failure | Reads and writes continue from the remaining copy, then recovery or replacement |
| 3 | 2 simultaneous node failures | Reads and writes continue from the remaining copies, then recovery or replacement |
What's Next¶
- Replication — how MD RAID replication works
- Volume Relocation — how volumes migrate during node drain
- Volume Expansion — grow a volume online without downtime
- Common Issues — symptom-driven troubleshooting
- Monitoring — proactive health tracking