Skip to content

Volume Relocation

MeshStor can move volume data between nodes without downtime. Relocation happens automatically during node drain and member replacement. This page covers all relocation scenarios, how to observe them, and how to troubleshoot common issues.

How Relocation Is Possible

Even with replicaCount=1, MeshStor creates a 2-slot RAID1 array: one slot holds the active partition, the other is a placeholder ("missing"). This placeholder slot allows data migration — a new partition fills the empty slot, syncs from the existing member, and then the original member is removed.

With replicaCount>=2, the array already has multiple active members. A new partition is added as a spare, and mdadm --replace gracefully swaps it in without any degraded window. The new partition briefly appears with state Spare in .status.partitions[] before the kernel promotes it into Syncing.

Consumer Node Drain

When you drain the node that hosts the MD RAID device (the consumer node), MeshStor migrates the volume to whichever node Kubernetes reschedules the pod to.

Migration is ultimately driven by Kubernetes rescheduling the pod — MeshStor reacts to the resulting NodeUnstageVolume and NodeStageVolume calls. An involuntarily unhealthy consumer (NotReady, lost from the cluster) triggers the same flow only if Kubernetes actually evicts and reschedules the pod, which in turn depends on the pod's controller (Deployment/StatefulSet), its tolerations for the node.kubernetes.io/not-ready taint, and cluster policy. An unmanaged pod stays stuck; kubectl drain --ignore-daemonsets is the reliable way to move the volume on demand.

What Happens

sequenceDiagram
    participant K as Kubernetes
    participant Old as Old Node
    participant New as New Node
    participant R as Reconciler (Old)

    K->>Old: kubectl drain --ignore-daemonsets (evicts pod)
    Old->>Old: NodeUnstageVolume: clear NodeName, stop MD
    K->>New: Schedule pod on new node
    New->>New: NodeStageVolume (attempt 1): no remote partitions yet, set NodeName=New, fail
    R->>R: Sees NodeName=New, exports partition via NVMe-oF
    New->>New: NodeStageVolume (retry): connect remote, assemble MD, create local partition, add to MD
    New->>New: MD syncs data from remote to local
    New->>New: Reconciler: detect excess member, remove old partition
  1. kubectl drain --ignore-daemonsets evicts the pod from the consumer node.
  2. NodeUnstageVolume on the old node clears NodeName in the CR and stops the MD device. The partition remains on disk.
  3. NodeStageVolume on the new node connects to the old partition via NVMe-oF, assembles the MD device from it, creates a new local partition, and adds it to the array. MD begins syncing data.
  4. The reconciler detects more members than expected, waits for all to be in sync, then removes the old remote partition.

replicaCount=1 vs replicaCount>=2

Aspect replicaCount=1 replicaCount>=2
MD state during migration 2 active members (was [local, missing]) replicaCount+1 members (local added as spare)
Removal method mdadm --fail + mdadm --remove (instant) mdadm --replace with spare (graceful)
Degraded window Brief moment during fail+remove None — spare syncs before old member is removed
Final state [local, missing] — same as before drain replicaCount active members

replicaCount=1 State Diagram

flowchart LR
    subgraph "Before Drain"
        A1["Node A: [local, missing]<br/>Pod running"]
    end

    subgraph "During Migration"
        B1["Node B: [remote(A), local(B)]<br/>MD syncing"]
    end

    subgraph "After Cleanup"
        C1["Node B: [local, missing]<br/>Pod running<br/>Node A partition removed"]
    end

    A1 -->|"kubectl drain --ignore-daemonsets A"| B1
    B1 -->|"sync complete"| C1

Note

The new node is selected by Kubernetes pod scheduling, not by MeshStor. MeshStor creates a local partition on whichever node the pod lands on.

Relocation Reserve

To make sure a node drain has somewhere to land, the controller reserves a configurable percentage of free space on every device when scoring placement candidates. A device whose biggest usable free space would dip below that reserve after the new partition lands is excluded from the candidate set. This is the default; without the reserve, a cluster running close to capacity would accept new volumes right up to the limit, leaving drains with nowhere to relocate displaced replicas.

Setting Default Where to configure
Cluster-wide reserve 10 % Helm value defaultRelocationReservePercent (or the --default-relocation-reserve-percent CLI flag on the node binary)
Per-node override inherits cluster default Node label meshstor.io/relocation-reserve-percent=<n> (set to 0 to disable the gate on a single node)

Set the reserve to 0 to disable the gate entirely — useful for single-node test clusters where no drain headroom is needed.

Member Replacement

When a partition is permanently lost — node offline, drive failure, or persistent network partition — MeshStor automatically replaces it after a configurable timeout. See Self-Healing: Automatic Replacement for the full flow, state transitions, and tuning options.

Provider Node Drain

Draining a node that only provides remote partitions (not the MD device host) has no immediate effect on volumes:

  • kubectl drain --ignore-daemonsets keeps the MeshStor DaemonSet pod running.
  • The DaemonSet continues exporting partitions via NVMe-oF.
  • The consumer node maintains its connections normally.

If the provider node is later shut down or decommissioned, the exported partitions become unreachable. This is handled by the self-healing flow after memberMissingTimeout.

Observing Relocation

Watch Volume Phases

kubectl get msvol -w

During a drain migration, you will see:

NAME               PHASE       MDSTATE      READY   DEGRADED   SYNC    NODE       AGE
pvc-cd1038a7-...   Synced      active       2/2     0                  mf-01-02   1h
pvc-cd1038a7-...   Syncing     recovering   2/3     1          12.5%   mf-01-02   1h
pvc-cd1038a7-...   Replacing   recovering   2/3     1          67.3%   mf-01-02   1h
pvc-cd1038a7-...   Synced      active       2/2     0                  mf-01-02   1h

Inspect Partition Details

kubectl get msvol my-volume -o jsonpath='{range .status.partitions[*]}{.nodeID}{"\t"}{.state}{"\n"}{end}'

During migration (both old and new partitions visible):

node-a    Synced
node-b    Created

After cleanup:

node-b    Synced

Check Sync Progress

kubectl get msvol my-volume
NAME               PHASE     MDSTATE      READY   DEGRADED   SYNC    NODE       AGE
pvc-cd1038a7-...   Syncing   recovering   1/2     1          45.2%   mf-01-03   1h

Troubleshooting

Volume Stuck in Replacing

The old partition is not being removed after drain migration.

  • Check sync progress: the old member is only removed after the new member finishes syncing. Watch kubectl get msvol -w for sync progress.
  • Check NVMe-oF connectivity: verify the consumer node can reach the replacement node on the NVMe-oF port (TCP 4420 or RDMA 4421).
  • Check the reconciler is running: verify the MeshStor DaemonSet pod is healthy on the consumer node: kubectl get pods -n meshstor -o wide.

No Replacement Node Available

All nodes already host a partition for this volume, or no node has sufficient free space.

  • Check available capacity: kubectl get msnd — look for nodes with free space.
  • Add capacity: add a new node with NVMe drives, or free space on existing nodes by deleting unused volumes. Run the bundled meshstor-cleanup helper on the affected node — see Disk Cleanup.

Partition Stuck in Missing

The partition is marked Missing but replacement has not started.

  • Check the timeout: kubectl get msvol my-volume -o jsonpath='{.spec.memberMissingTimeout}' — replacement waits for this timeout to expire. Tune memberMissingTimeout if needed.
  • Check the timestamp: kubectl get msvol my-volume -o yaml — inspect .status.partitions[].updatedAt to see when the partition was marked Missing.
  • Check the consumer DaemonSet: the reconciler runs on the consumer node. If that pod is unhealthy, replacement will not trigger.

What's Next

  • Self-Healing — automatic recovery from node and network failures
  • Volume Expansion — grow a volume while the pod is running
  • Monitoring — observe relocation progress and health
  • Replication — how MD RAID replication enables non-disruptive moves