Replication¶

MeshStor replicates data using Linux MD RAID across multiple nodes. Each volume partition lives on a different node's NVMe drive, and MD assembles them into a single redundant block device. This is the same replication technology used by Linux servers for decades — no custom replication protocol.

How It Works¶

flowchart TB
    subgraph "Node A (volume owner)"
        PA["Partition A\n(local, nvme0n1p3)"]
        MD["MD RAID1\n/dev/md0"]
        XFS["XFS Filesystem"]
        POD["Pod"]
        PA --> MD
        MD --> XFS
        XFS --> POD
    end

    subgraph "Node B"
        PB["Partition B\n(local, nvme0n1p5)"]
    end

    PB -- "NVMe-oF TCP/RDMA" --> MD

The controller creates a MeshStorVolume CR and selects nodes for partition placement
Each selected node creates a GPT partition on a local NVMe drive
Remote nodes export their partitions via NVMe-oF
The volume owner node imports remote partitions and assembles an MD RAID array
The array is formatted with XFS and mounted to the pod

Configuration¶

Replication is configured through StorageClass parameters:

Parameter	Default	Description
`numberOfCopies`	`2`	Number of data replicas. Each copy lives on a different node. Min: `1`.
`drivesPerCopy`	`1`	Number of drives per copy. Values `>1` create RAID10. Min: `1`.
`memberMissingTimeout`	`900`	Seconds before a missing member is marked faulty and replaced. Min: `60`.

RAID Levels¶

The combination of numberOfCopies and drivesPerCopy determines the effective RAID level:

numberOfCopies	drivesPerCopy	Effective RAID	Total Partitions	Fault Tolerance
1	1	RAID1 (2 slots, 1 active + 1 placeholder)	2	Can relocate to another node
2	1	RAID1	2	Survives 1 node failure
3	1	RAID1	3	Survives 2 node failures
2	2	RAID10	4	Survives 1 node failure per copy

Single-copy mode

With numberOfCopies=1, MeshStor still creates a 2-slot RAID1 array. One slot holds the active partition, the other is a placeholder ("missing"). This allows the volume to be relocated to a different node without data loss — the placeholder slot receives a new partition on the target node, syncs, and the original partition is removed.

Degraded Operation¶

When a member partition becomes unreachable (node failure, network issue, drive error), the MD array enters degraded mode:

I/O continues — reads and writes proceed using the remaining active members
Volume status reflects the degradation:

kubectl get msvol my-volume -o wide

NAME        PHASE    MDSTATE      TOTAL   ACTIVE   FAILED   DOWN   SYNC
my-volume   Synced   degraded     2       1        0        1

Automatic Recovery¶

If the missing member comes back online (e.g., node reboots), the reconciliation loop detects it and triggers a rebuild. The syncPercentage field tracks rebuild progress:

kubectl get msvol my-volume -w

NAME        PHASE     MDSTATE       SYNC
my-volume   Syncing   recovering    45.2%
my-volume   Syncing   recovering    78.9%
my-volume   Synced    active

Member Replacement¶

If a member stays missing for longer than memberMissingTimeout (default: 15 minutes), MeshStor automatically replaces it:

The missing partition is marked Faulty
A replacement node is selected using the same scoring algorithm
A new partition is created on the replacement node
The new partition is added to the MD array and rebuilds

Volume Relocation¶

When a node hosting the MD device (consumer) is drained with kubectl drain, MeshStor automatically migrates the volume to the new node. The old partition is imported via NVMe-oF, a new local partition is created, and MD syncs the data. Once synced, the old partition is removed. Draining a provider-only node has no immediate effect — the DaemonSet continues exporting partitions normally.

See Volume Relocation for detailed scenarios, observability commands, and troubleshooting.

What's Next¶

Self-Healing — automatic recovery and replacement on failures
Volume Relocation — how volumes migrate during node drain
StorageClass Examples — ready-to-use configurations
StorageClass Parameters — complete parameter reference