Self-Healing¶

MeshStor automatically recovers from network failures, node outages, and drive loss. No administrator intervention is required. With numberOfCopies>=2, volumes continue serving reads and writes from the remaining copies while recovery happens in the background.

How It Works¶

A reconciliation loop runs on every node every 10 seconds. Each cycle:

Detects faulty or missing MD RAID members
Attempts to reconnect lost NVMe-oF connections
Re-adds recovered partitions to the MD array
Requests replacement partitions for members that have been missing too long

Network Failure Timeline¶

When a remote node becomes unreachable:

Time	What Happens
0s	Network connection drops
1s	NVMe keep-alive probe fails, I/O errors begin (`fast_io_fail_tmo=1s`). Fast detection lets MD RAID switch to serving I/O from the remaining copies instead of stalling on an unresponsive device.
3s	NVMe controller removed from kernel (`ctrl_loss_tmo=3s`), block device disappears. The short timeout prevents the kernel from holding onto an unresponsive controller.
~10s	Next Probe cycle. MD detects the missing member and removes it from the array. With `numberOfCopies>=2`, the volume continues in degraded mode — reads and writes use the remaining copies. With `numberOfCopies=1`, MD stops because no copies remain, and the pod is evicted so Kubernetes can reschedule it to a node where the data becomes available again.
every ~10s	Reconciler attempts to reconnect NVMe-oF to the lost node.
on recovery	NVMe reconnects, partition re-added to MD. MD uses the write-intent bitmap for fast incremental resync (only blocks written during the outage).
15 min (default)	If still missing: partition marked for deletion, replacement requested on a healthy node. The timeout avoids replacing partitions that would have recovered on their own — a replacement triggers a full resync which is more expensive than waiting. Configurable via `memberMissingTimeout`.

Note

With numberOfCopies>=2, the volume stays readable and writable throughout this process — applications see no interruption as long as at least one copy remains active.

Warning

With numberOfCopies=1, any member loss makes the volume unavailable. The MD array is stopped and the pod is evicted so that Kubernetes can reschedule it. Network-related unavailability only applies when the current node lacks enough local storage to migrate partition from a remote node to local. When the partition is local, network failures do not affect the volume.

Recovery vs. Replacement¶

MeshStor distinguishes between two scenarios:

Automatic Recovery (node comes back before timeout)¶

sequenceDiagram
    participant R as Reconciler
    participant MD as MD RAID
    participant N as NVMe-oF

    R->>N: Attempt reconnect (every Probe)
    N-->>R: Connection restored
    R->>MD: mdadm --re-add (bitmap resync)
    MD-->>R: Syncing... Synced

With numberOfCopies>=2: the original partition is re-added to the running array (mdadm --re-add). MD uses its write-intent bitmap for incremental resync — only blocks written during the outage are rebuilt. Resync is fast (seconds to minutes depending on write volume during the outage).

Automatic Replacement (node stays down past timeout)¶

sequenceDiagram
    participant R as Reconciler
    participant K as Kubernetes API
    participant New as New Node

    R->>K: Transition Missing → Deleting
    R->>K: Select best available node, request new partition
    New->>New: Create partition, export via NVMe-oF
    R->>R: Import new partition, mdadm --add
    R->>R: Full resync from remaining members
    R->>K: Transition Syncing → Synced

A new node is selected based on RDMA connectivity, free space, and latency
Nodes that already hold a partition for this volume are excluded — this maintains fault isolation so a single node failure cannot take out multiple copies
Full resync is required (no bitmap — this is a brand-new partition with no prior data)

Partition State Transitions¶

Synced → Missing → (recovered)  → Syncing → Synced
                 → (timed out)  → Deleting → removed
                                  (new)    → Requested → Created → Syncing → Synced

State	Meaning
`Synced`	Partition is healthy and in sync
`Missing`	Partition was reachable but is not anymore. Recovery is attempted each Probe cycle.
`Syncing`	Partition is being rebuilt (recovery or replacement)
`Deleting`	Partition timed out and is being cleaned up
`Requested`	Replacement partition requested on a new node, not yet created
`Created`	Replacement partition exists on the new node, waiting to be imported

Observing Self-Healing¶

Watch volume state in real time:

kubectl get msvol -w

NAME        PHASE     MDSTATE      TOTAL   ACTIVE   FAILED   DOWN   SYNC
my-volume   Synced    active       2       2        0        0
my-volume   Synced    degraded     2       1        0        1
my-volume   Syncing   recovering   2       1        0        1      23.4%
my-volume   Synced    active       2       2        0        0

Inspect individual partition states:

kubectl get msvol my-volume -o jsonpath='{range .status.partitions[*]}{.nodeID}{"\t"}{.state}{"\t"}{.updatedAt}{"\n"}{end}'

Tuning¶

memberMissingTimeout¶

Controls how long a partition stays in Missing before replacement begins.

Value	Trade-off
900s (default, 15 min)	Tolerates long outages without unnecessary rebuilds
120s (2 min)	Faster recovery, but brief network blips trigger full rebuilds
60s (minimum)	Aggressive — use only when fast replacement is critical and network is stable

Set per volume:

kubectl patch msvol my-volume --type=merge \
  -p '{"spec":{"memberMissingTimeout":120}}'

Set for all new volumes via StorageClass:

parameters:
  memberMissingTimeout: "120"

Warning

Lower timeouts increase the chance of unnecessary replacements. Recovery (bitmap resync) is much cheaper than replacement (full resync), so it is better to wait for a node to come back than to replace it prematurely. The default 15 minutes is a safe choice for most environments.

numberOfCopies¶

Higher copy counts provide more resilience:

numberOfCopies	Survives	Recovery behavior
1	Relocations only (no redundancy)	Volume unavailable during outage. Recovers when node with storage returns.
2	1 node failure	Reads and writes continue from the remaining copy, then recovery or replacement
3	2 simultaneous node failures	Reads and writes continue from the remaining copies, then recovery or replacement

What's Next¶

Replication — how MD RAID replication works
Volume Relocation — how volumes migrate during node drain
Volume Expansion — grow a volume online without downtime
Common Issues — symptom-driven troubleshooting
Monitoring — proactive health tracking