Skip to content

Replication

MeshStor replicates data using Linux MD RAID across multiple nodes. Each volume partition lives on a different node's NVMe drive, and MD assembles them into a single redundant block device. This is the same replication technology used by Linux servers for decades — no custom replication protocol.

How It Works

Data path: an application pod writes through the Linux kernel's MD RAID1, fanning out to the local NVMe drive and to a remote replica node over NVMe-oF RDMA

  1. The controller creates a MeshStorVolume CR when a PVC is provisioned.
  2. Kubernetes schedules the pod onto a node — the consumer — and the CSI node plugin there picks replica nodes via the scoring algorithm, recording the requests in the CR.
  3. Each selected node creates a GPT partition on a local NVMe drive.
  4. Remote nodes export their partitions via NVMe-oF.
  5. The consumer imports the remote partitions and assembles an MD RAID array.
  6. The array is formatted with XFS and mounted to the pod.

Configuration

Three StorageClass parameters shape replication:

  • replicaCount — how many copies the volume keeps, each on a different node.
  • stripeWidth — how many local drives a single replica spans. 1 means RAID1; >1 means RAID10 (a stripe of mirrors).
  • memberMissingTimeout — how long an unreachable member is given to recover before MeshStor replaces it.

The combination of replicaCount and stripeWidth selects the effective RAID level. See StorageClass Parameters for the full grid (types, defaults, minimums, storage overhead) and StorageClass Examples for ready-to-use configurations.

Single-replica mode is still relocatable

Even with replicaCount=1, the volume can move to a different node when the pod reschedules. The single-replica array always carries a free placeholder slot (a missing member); on relocation, MeshStor creates a new partition on the target node, fills the placeholder, syncs, then removes the original — no data loss.

Setting Layout
stripeWidth=1 2-slot RAID1: 1 active + 1 placeholder with local drive throughput
stripeWidth>1 RAID10 with placeholder mirror slots and local RAID0-like throughput

Degraded Operation

When a member partition becomes unreachable (node failure, network issue, drive error), the MD array enters degraded mode:

  • I/O continues — reads and writes proceed using the remaining active members
  • Volume status reflects the degradation:

kubectl get msvol my-volume
NAME               PHASE    MDSTATE    READY   DEGRADED   SYNC   NODE       AGE
pvc-cd1038a7-...   Synced   degraded   1/2     1                 mf-01-02   2h

Automatic Recovery

If the missing member comes back online (e.g., node reboots), the reconciliation loop detects it and triggers a rebuild. The syncPercentage field tracks rebuild progress:

kubectl get msvol my-volume -w
NAME               PHASE     MDSTATE      READY   DEGRADED   SYNC    NODE       AGE
pvc-cd1038a7-...   Syncing   recovering   1/2     1          45.2%   mf-01-02   2h
pvc-cd1038a7-...   Syncing   recovering   1/2     1          78.9%   mf-01-02   2h
pvc-cd1038a7-...   Synced    active       2/2     0                  mf-01-02   2h

Member Replacement

If a member stays missing for longer than memberMissingTimeout (default: 15 minutes), MeshStor automatically replaces it:

  1. The missing partition is marked Faulty
  2. A replacement node is selected using the same scoring algorithm
  3. A new partition is created on the replacement node
  4. The new partition is added to the MD array and rebuilds

Volume Relocation

When a node hosting the MD device (consumer) is drained with kubectl drain --ignore-daemonsets, MeshStor automatically migrates the volume to the new node. The old partition is imported via NVMe-oF, a new local partition is created, and MD syncs the data. Once synced, the old partition is removed. Draining a provider-only node has no immediate effect — the DaemonSet continues exporting partitions normally.

See Volume Relocation for detailed scenarios, observability commands, and troubleshooting.

What's Next