Common Issues¶

Symptom-driven troubleshooting for the most frequently encountered problems.

PVC Stuck in Pending¶

Symptom: PVC stays Pending and never binds to a PV.

Possible causes:

Cause	Diagnosis	Fix
No free space on any node	`kubectl get msnd` — check `FREE` column	Add drives or delete unused volumes
No nodes with matching NVMe drives	`kubectl get msnd` — no resources listed	Verify NVMe drives exist and node pods are running
StorageClass misconfigured	`kubectl describe pvc <name>` — check events	Verify provisioner is `io.meshstor.csi.mesh`
Controller not running	`kubectl -n meshstor get pods` — controller pod missing or crash-looping	Check controller logs: `kubectl -n meshstor logs statefulset/meshstor-csi-controller -c csi-plugin`

Volume Stuck in Requested Phase¶

Symptom: kubectl get msvol shows phase Requested for more than a few minutes.

Cause: Remote nodes cannot create the requested partitions. The controller selected nodes, but those nodes haven't acted on the request.

Diagnosis:

# Check which nodes were selected for partitions
kubectl get msvol <name> -o jsonpath='{range .status.partitions[*]}{.nodeID} {.state}{"\n"}{end}'

Fix:

Verify the target node pods are running: kubectl -n meshstor get pods -o wide | grep <node-name>
Check node pod logs for errors: kubectl -n meshstor logs <node-pod> -c csi-plugin | tail -50
Verify NVMe-oF kernel modules are loaded on the target node: ls /sys/kernel/config/nvmet/
Verify the target node has free space: kubectl get msnd | grep <node-name>

Volume Stuck in Syncing Phase¶

Symptom: kubectl get msvol shows phase Syncing with a syncPercentage that isn't progressing.

Cause: MD RAID rebuild is stalled, often due to a disconnected or slow member.

Diagnosis:

# Watch sync progress
kubectl get msvol <name> -w

# Check partition states — look for Missing or Faulty
kubectl get msvol <name> -o jsonpath='{range .status.partitions[*]}{.nodeID} {.state}{"\n"}{end}'

Fix:

If a partition is Missing: the remote node may be down or NVMe-oF connection dropped. Check node status and network connectivity.
If sync percentage is advancing slowly: this is normal for large volumes. MD rebuild speed depends on I/O load and network throughput.
If sync percentage is stuck at 0%: check if the remote partition is actually connected. Look for NVMe-oF errors in the node pod logs.

Mount Fails with Unavailable¶

Symptom: Pod cannot start, events show FailedMount with gRPC code Unavailable.

Cause: No partitions are reachable — neither local nor remote.

Diagnosis:

# Check volume partition states
kubectl get msvol <name> -o yaml | grep -A5 partitions

# Check node connectivity
kubectl -n meshstor logs <node-pod> -c csi-plugin | grep -i "nvme\|connect\|import"

Fix:

Verify NVMe-oF node annotations are set: kubectl get node <name> -o jsonpath='{.metadata.annotations}'
Verify ports 4420/4421 are open between nodes
Verify the remote node's NVMe-oF target is running: ls /sys/kernel/config/nvmet/subsystems/

Volume Expansion Fails¶

Symptom: PVC resize request is rejected or stays in Expanding phase.

Possible causes:

Cause	Diagnosis	Fix
`drivesPerCopy > 1`	Check StorageClass parameters	Expansion is only supported when `drivesPerCopy=1`. This is a design limitation.
Not enough free space after partition	`kubectl get msnd` — check `FREE` on the partition's node	Free space on the same drive must be contiguous and large enough for the growth
Volume not yet synced	`kubectl get msvol <name>` — phase is not `Synced`	Wait for the volume to finish syncing before expanding

Pod Evicted from Node¶

Symptom: Pod is unexpectedly evicted with a MeshStor-related event.

Cause: MeshStor evicts pods to enforce single-node-writer semantics. This happens when:

A volume is mounted on a node, but the volume's owner node changes (e.g., during node failure and recovery)
An unmanaged pod (no controller like Deployment/StatefulSet) is using a volume that needs to be unmounted

Fix:

For managed pods (Deployments, StatefulSets): the pod will be rescheduled automatically
For unmanaged pods: recreate the pod after the volume has been safely unmounted

Gathering Diagnostics¶

When filing a bug report, collect:

# Cluster state
kubectl get nodes -o wide
kubectl get -A pods,pvc,pv,msvol,msnd

# Controller logs
kubectl -n meshstor logs statefulset/meshstor-csi-controller -c csi-plugin --tail=200

# Node logs (for the affected node)
kubectl -n meshstor logs <node-pod> -c csi-plugin --tail=200

# Volume detail
kubectl get msvol <volume-name> -o yaml

# Events
kubectl get events --sort-by='.lastTimestamp' | grep -i meshstor

What's Next¶

Self-Healing — automatic recovery from failures
Volume Relocation — data migration during node drain
Monitoring — proactive health tracking
Internals — understand component interactions for deeper debugging