Common Issues¶
Symptom-driven troubleshooting for the most frequently encountered problems.
PVC Stuck in Pending¶
Symptom: PVC stays Pending and never binds to a PV.
Possible causes:
| Cause | Diagnosis | Fix |
|---|---|---|
| No free space on any node | kubectl get msnd — check FREE column |
Add drives or delete unused volumes |
| No nodes with matching NVMe drives | kubectl get msnd — no resources listed |
Verify NVMe drives exist and node pods are running |
| StorageClass misconfigured | kubectl describe pvc <name> — check events |
Verify provisioner is io.meshstor.csi.mesh |
| Controller not running | kubectl -n meshstor get pods — controller pod missing or crash-looping |
Check controller logs: kubectl -n meshstor logs statefulset/meshstor-csi-controller -c csi-plugin |
Volume Stuck in Requested Phase¶
Symptom: kubectl get msvol shows phase Requested for more than a few minutes.
Cause: Remote nodes cannot create the requested partitions. The controller selected nodes, but those nodes haven't acted on the request.
Diagnosis:
# Check which nodes were selected for partitions
kubectl get msvol <name> -o jsonpath='{range .status.partitions[*]}{.nodeID} {.state}{"\n"}{end}'
Fix:
- Verify the target node pods are running:
kubectl -n meshstor get pods -o wide | grep <node-name> - Check node pod logs for errors:
kubectl -n meshstor logs <node-pod> -c csi-plugin | tail -50 - Verify NVMe-oF kernel modules are loaded on the target node:
ls /sys/kernel/config/nvmet/ - Verify the target node has free space:
kubectl get msnd | grep <node-name>
Volume Stuck in Syncing Phase¶
Symptom: kubectl get msvol shows phase Syncing with a syncPercentage that isn't progressing.
Cause: MD RAID rebuild is stalled, often due to a disconnected or slow member.
Diagnosis:
# Watch sync progress
kubectl get msvol <name> -w
# Check partition states — look for Missing or Faulty
kubectl get msvol <name> -o jsonpath='{range .status.partitions[*]}{.nodeID} {.state}{"\n"}{end}'
Fix:
- If a partition is
Missing: the remote node may be down or NVMe-oF connection dropped. Check node status and network connectivity. - If sync percentage is advancing slowly: this is normal for large volumes. MD rebuild speed depends on I/O load and network throughput.
- If sync percentage is stuck at 0%: check if the remote partition is actually connected. Look for NVMe-oF errors in the node pod logs.
Mount Fails with Unavailable¶
Symptom: Pod cannot start, events show FailedMount with gRPC code Unavailable.
Cause: No partitions are reachable — neither local nor remote.
Diagnosis:
# Check volume partition states
kubectl get msvol <name> -o yaml | grep -A5 partitions
# Check node connectivity
kubectl -n meshstor logs <node-pod> -c csi-plugin | grep -i "nvme\|connect\|import"
Fix:
- Verify NVMe-oF node annotations are set:
kubectl get node <name> -o jsonpath='{.metadata.annotations}' - Verify ports 4420/4421 are open between nodes
- Verify the remote node's NVMe-oF target is running:
ls /sys/kernel/config/nvmet/subsystems/
Volume Expansion Fails¶
Symptom: PVC resize request is rejected or stays in Expanding phase.
Possible causes:
| Cause | Diagnosis | Fix |
|---|---|---|
drivesPerCopy > 1 |
Check StorageClass parameters | Expansion is only supported when drivesPerCopy=1. This is a design limitation. |
| Not enough free space after partition | kubectl get msnd — check FREE on the partition's node |
Free space on the same drive must be contiguous and large enough for the growth |
| Volume not yet synced | kubectl get msvol <name> — phase is not Synced |
Wait for the volume to finish syncing before expanding |
Pod Evicted from Node¶
Symptom: Pod is unexpectedly evicted with a MeshStor-related event.
Cause: MeshStor evicts pods to enforce single-node-writer semantics. This happens when:
- A volume is mounted on a node, but the volume's owner node changes (e.g., during node failure and recovery)
- An unmanaged pod (no controller like Deployment/StatefulSet) is using a volume that needs to be unmounted
Fix:
- For managed pods (Deployments, StatefulSets): the pod will be rescheduled automatically
- For unmanaged pods: recreate the pod after the volume has been safely unmounted
Gathering Diagnostics¶
When filing a bug report, collect:
# Cluster state
kubectl get nodes -o wide
kubectl get -A pods,pvc,pv,msvol,msnd
# Controller logs
kubectl -n meshstor logs statefulset/meshstor-csi-controller -c csi-plugin --tail=200
# Node logs (for the affected node)
kubectl -n meshstor logs <node-pod> -c csi-plugin --tail=200
# Volume detail
kubectl get msvol <volume-name> -o yaml
# Events
kubectl get events --sort-by='.lastTimestamp' | grep -i meshstor
What's Next¶
- Self-Healing — automatic recovery from failures
- Volume Relocation — data migration during node drain
- Monitoring — proactive health tracking
- Internals — understand component interactions for deeper debugging