Volume Expansion¶
MeshStor supports online volume expansion — you can grow a PVC while the pod is running, with no downtime. The driver grows each partition on every node, then expands the MD RAID array and XFS filesystem in place.
How It Works¶
- You patch the PVC with a larger storage request.
- The controller validates the request, updates
spec.capacityBytes, and sets the volume phase toExpanding. - The consumer node (where the PVC mounted) grows its local partition.
- Each remote node (where replica lives) grows its partition independently via the reconciler.
- Once all partitions are at the new size, the consumer node runs
mdadm --growandxfs_growfsto expand the MD array and filesystem. - The volume phase returns to
Synced.
Sequence diagram
sequenceDiagram
participant User
participant API as Kubernetes API
participant csi-resizer
participant Controller
participant Remote as Remote Node
participant kubelet
participant Consumer as Consumer Node
User->>API: Patch PVC with larger size
API-->>csi-resizer: csi-resizer watch API changes
csi-resizer->>Controller: ControllerExpandVolume
Controller->>API: Update CR capacityBytes, set phase=Expanding
API-->>Remote: kubelet watch API changes
Remote->>Remote: Reconciler: grow replica partition
Remote->>API: Update partition SizeBytes in CR
API-->>kubelet: kubelet watch API changes
kubelet->>Consumer: NodeExpandVolume
Consumer->>Consumer: Grow local partition in-place
Consumer->>Consumer: Verify all partitions expanded
Consumer->>Consumer: mdadm --grow (expand MD array)
Consumer->>Consumer: xfs_growfs (expand filesystem)
Consumer->>API: Set phase=Synced
In-Place Growth vs Replacement¶
MeshStor first tries to grow each partition in place by extending it into adjacent free space on the same drive. This is fast and requires no data sync.
If a drive lacks adjacent free space, MeshStor creates a replacement partition on another drive on the same node, swaps it into the MD array via mdadm --replace, and removes the old partition. This path is slower because the replacement must sync all data from the other members.
Expansion-replacement shares the state machine that drives ordinary faulty-member replacement. While it runs, the volume's .status.partitions[] can temporarily hold two entries for the same node — the original (still-small) partition and the new grown one — disambiguated by their nvmeofNamespaceID. The old entry is only removed once the new one has finished syncing, so the volume keeps full redundancy throughout the replacement.
Issuing a further resize while an expansion is still in flight is accepted: the driver compares the new target against the current per-partition sizes and converges toward it. Each xfs_growfs is gated on the MD array having all members at the new size, so the filesystem never races ahead of the underlying partitions.
Prerequisites¶
- StorageClass must have
allowVolumeExpansion: true. All default examples except RAID10 have this enabled. stripeWidthmust be1. Expansion is not supported for striped volumes (stripeWidth > 1).- Sufficient free space on the drives hosting partitions. Each partition must be able to grow by the requested delta — either in place, or on another drive of the same node via replacement.
Expanding a Volume¶
Edit the PVC to request a larger size:
Or use kubectl edit:
Note
The actual allocated size may be slightly larger than requested due to 1 MiB alignment.
Observing Expansion¶
Watch Volume Phases¶
Typical output during expansion:
NAME PHASE MDSTATE READY DEGRADED SYNC NODE AGE
pvc-cd1038a7-... Expanding active 2/2 0 mf-01-02 1h
pvc-cd1038a7-... Synced active 2/2 0 mf-01-02 1h
If replacement partitions are needed (no adjacent free space), you will see a sync phase:
NAME PHASE MDSTATE READY DEGRADED SYNC NODE AGE
pvc-cd1038a7-... Expanding active 2/2 0 mf-01-02 1h
pvc-cd1038a7-... Expanding recovering 1/2 1 12.5% mf-01-02 1h
pvc-cd1038a7-... Expanding recovering 1/2 1 67.3% mf-01-02 1h
pvc-cd1038a7-... Expanding active 2/2 0 mf-01-02 1h
pvc-cd1038a7-... Synced active 2/2 0 mf-01-02 1h
Check Partition Sizes¶
Verify that all partitions have been expanded:
kubectl get msvol my-volume -o jsonpath='{range .status.partitions[*]}{.nodeID}{"\t"}{.sizeBytes}{"\t"}{.state}{"\n"}{end}'
Verify Filesystem Size¶
From inside the pod:
The filesystem size should reflect the new capacity (minus a small overhead for XFS metadata).
Constraints and Limitations¶
| Constraint | Detail |
|---|---|
stripeWidth > 1 |
Expansion is not supported for striped (RAID10) volumes. Set allowVolumeExpansion: false in the StorageClass. |
| Shrinking | Volume shrinking is not supported. Requests for a smaller size are rejected. |
| Consecutive expansion | A second resize issued before the first has reached Synced is accepted; the driver converges partitions toward the latest requested size. |
| Replacement sync time | If in-place growth fails on any partition, the replacement path requires a full data sync. Duration depends on volume size and network throughput. |
Troubleshooting¶
Expansion Stuck in Expanding Phase¶
The volume stays in Expanding and never transitions to Synced.
Check partition sizes in the CR:
Look at status.partitions[*].sizeBytes. If any partition still shows the old size, that node has not completed its expansion.
Check the node logs for the lagging partition:
# Find which node has the undersized partition, then check its logs:
kubectl logs -n meshstor daemonset/meshstor-csi-node --selector=... | grep -i expand
Common causes:
- No free space on the drive. The partition cannot grow in place and no other drive has enough room for a replacement. Check available capacity with
kubectl get msnd. - Node is down. The remote node's reconciler cannot run. Verify the DaemonSet pod is healthy on that node.
- NVMe-oF connectivity issue. The consumer node cannot reach the remote node to import the replacement partition. Check network connectivity on TCP port 4420 (or RDMA port 4421).
Expansion Rejected¶
The kubectl patch command returns an error.
allowVolumeExpansionnot set: The StorageClass must haveallowVolumeExpansion: true.stripeWidth > 1: Expansion is only supported forstripeWidth=1. RAID10 expansion is part of the roadmap.- Size not larger: The requested size must be strictly larger than the current size.
Filesystem Size Does Not Match¶
After expansion completes (phase=Synced), df inside the pod shows a smaller size than expected.
XFS metadata overhead accounts for a small percentage of the raw capacity. Difference under 10% is normal.
What's Next¶
- Replication — how replicas and striping affect fault tolerance
- Volume Relocation — how volumes migrate during node drain
- StorageClass Parameters — complete parameter reference
- Monitoring — observability and alerting for MeshStor volumes