Skip to content

Replication

MeshStor replicates data using Linux MD RAID across multiple nodes. Each volume partition lives on a different node's NVMe drive, and MD assembles them into a single redundant block device. This is the same replication technology used by Linux servers for decades — no custom replication protocol.

How It Works

flowchart TB
    subgraph "Node A (volume owner)"
        PA["Partition A\n(local, nvme0n1p3)"]
        MD["MD RAID1\n/dev/md0"]
        XFS["XFS Filesystem"]
        POD["Pod"]
        PA --> MD
        MD --> XFS
        XFS --> POD
    end

    subgraph "Node B"
        PB["Partition B\n(local, nvme0n1p5)"]
    end

    PB -- "NVMe-oF TCP/RDMA" --> MD
  1. The controller creates a MeshStorVolume CR and selects nodes for partition placement
  2. Each selected node creates a GPT partition on a local NVMe drive
  3. Remote nodes export their partitions via NVMe-oF
  4. The volume owner node imports remote partitions and assembles an MD RAID array
  5. The array is formatted with XFS and mounted to the pod

Configuration

Replication is configured through StorageClass parameters:

Parameter Default Description
numberOfCopies 2 Number of data replicas. Each copy lives on a different node. Min: 1.
drivesPerCopy 1 Number of drives per copy. Values >1 create RAID10. Min: 1.
memberMissingTimeout 900 Seconds before a missing member is marked faulty and replaced. Min: 60.

RAID Levels

The combination of numberOfCopies and drivesPerCopy determines the effective RAID level:

numberOfCopies drivesPerCopy Effective RAID Total Partitions Fault Tolerance
1 1 RAID1 (2 slots, 1 active + 1 placeholder) 2 Can relocate to another node
2 1 RAID1 2 Survives 1 node failure
3 1 RAID1 3 Survives 2 node failures
2 2 RAID10 4 Survives 1 node failure per copy

Single-copy mode

With numberOfCopies=1, MeshStor still creates a 2-slot RAID1 array. One slot holds the active partition, the other is a placeholder ("missing"). This allows the volume to be relocated to a different node without data loss — the placeholder slot receives a new partition on the target node, syncs, and the original partition is removed.

Degraded Operation

When a member partition becomes unreachable (node failure, network issue, drive error), the MD array enters degraded mode:

  • I/O continues — reads and writes proceed using the remaining active members
  • Volume status reflects the degradation:

kubectl get msvol my-volume -o wide
NAME        PHASE    MDSTATE      TOTAL   ACTIVE   FAILED   DOWN   SYNC
my-volume   Synced   degraded     2       1        0        1

Automatic Recovery

If the missing member comes back online (e.g., node reboots), the reconciliation loop detects it and triggers a rebuild. The syncPercentage field tracks rebuild progress:

kubectl get msvol my-volume -w
NAME        PHASE     MDSTATE       SYNC
my-volume   Syncing   recovering    45.2%
my-volume   Syncing   recovering    78.9%
my-volume   Synced    active

Member Replacement

If a member stays missing for longer than memberMissingTimeout (default: 15 minutes), MeshStor automatically replaces it:

  1. The missing partition is marked Faulty
  2. A replacement node is selected using the same scoring algorithm
  3. A new partition is created on the replacement node
  4. The new partition is added to the MD array and rebuilds

Volume Relocation

When a node hosting the MD device (consumer) is drained with kubectl drain, MeshStor automatically migrates the volume to the new node. The old partition is imported via NVMe-oF, a new local partition is created, and MD syncs the data. Once synced, the old partition is removed. Draining a provider-only node has no immediate effect — the DaemonSet continues exporting partitions normally.

See Volume Relocation for detailed scenarios, observability commands, and troubleshooting.

What's Next