Replication¶
MeshStor replicates data using Linux MD RAID across multiple nodes. Each volume partition lives on a different node's NVMe drive, and MD assembles them into a single redundant block device. This is the same replication technology used by Linux servers for decades — no custom replication protocol.
How It Works¶
flowchart TB
subgraph "Node A (volume owner)"
PA["Partition A\n(local, nvme0n1p3)"]
MD["MD RAID1\n/dev/md0"]
XFS["XFS Filesystem"]
POD["Pod"]
PA --> MD
MD --> XFS
XFS --> POD
end
subgraph "Node B"
PB["Partition B\n(local, nvme0n1p5)"]
end
PB -- "NVMe-oF TCP/RDMA" --> MD
- The controller creates a
MeshStorVolumeCR and selects nodes for partition placement - Each selected node creates a GPT partition on a local NVMe drive
- Remote nodes export their partitions via NVMe-oF
- The volume owner node imports remote partitions and assembles an MD RAID array
- The array is formatted with XFS and mounted to the pod
Configuration¶
Replication is configured through StorageClass parameters:
| Parameter | Default | Description |
|---|---|---|
numberOfCopies |
2 |
Number of data replicas. Each copy lives on a different node. Min: 1. |
drivesPerCopy |
1 |
Number of drives per copy. Values >1 create RAID10. Min: 1. |
memberMissingTimeout |
900 |
Seconds before a missing member is marked faulty and replaced. Min: 60. |
RAID Levels¶
The combination of numberOfCopies and drivesPerCopy determines the effective RAID level:
| numberOfCopies | drivesPerCopy | Effective RAID | Total Partitions | Fault Tolerance |
|---|---|---|---|---|
| 1 | 1 | RAID1 (2 slots, 1 active + 1 placeholder) | 2 | Can relocate to another node |
| 2 | 1 | RAID1 | 2 | Survives 1 node failure |
| 3 | 1 | RAID1 | 3 | Survives 2 node failures |
| 2 | 2 | RAID10 | 4 | Survives 1 node failure per copy |
Single-copy mode
With numberOfCopies=1, MeshStor still creates a 2-slot RAID1 array. One slot holds the active partition, the other is a placeholder ("missing"). This allows the volume to be relocated to a different node without data loss — the placeholder slot receives a new partition on the target node, syncs, and the original partition is removed.
Degraded Operation¶
When a member partition becomes unreachable (node failure, network issue, drive error), the MD array enters degraded mode:
- I/O continues — reads and writes proceed using the remaining active members
- Volume status reflects the degradation:
Automatic Recovery¶
If the missing member comes back online (e.g., node reboots), the reconciliation loop detects it and triggers a rebuild. The syncPercentage field tracks rebuild progress:
NAME PHASE MDSTATE SYNC
my-volume Syncing recovering 45.2%
my-volume Syncing recovering 78.9%
my-volume Synced active
Member Replacement¶
If a member stays missing for longer than memberMissingTimeout (default: 15 minutes), MeshStor automatically replaces it:
- The missing partition is marked
Faulty - A replacement node is selected using the same scoring algorithm
- A new partition is created on the replacement node
- The new partition is added to the MD array and rebuilds
Volume Relocation¶
When a node hosting the MD device (consumer) is drained with kubectl drain, MeshStor automatically migrates the volume to the new node. The old partition is imported via NVMe-oF, a new local partition is created, and MD syncs the data. Once synced, the old partition is removed. Draining a provider-only node has no immediate effect — the DaemonSet continues exporting partitions normally.
See Volume Relocation for detailed scenarios, observability commands, and troubleshooting.
What's Next¶
- Self-Healing — automatic recovery and replacement on failures
- Volume Relocation — how volumes migrate during node drain
- StorageClass Examples — ready-to-use configurations
- StorageClass Parameters — complete parameter reference