Monitoring¶
MeshStor exposes volume health through Kubernetes custom resources. This page explains what to watch and how to set up alerts.
Volume Health¶
The primary health indicator is the MeshStorVolume (msvol) custom resource:
NAME PHASE MDSTATE TOTAL ACTIVE FAILED DOWN SYNC AGE
pvc-abc123.. Synced active 2 2 0 0 1h
pvc-def456.. Syncing recovering 2 1 0 1 45.2% 30m
pvc-ghi789.. Synced degraded 3 2 1 0 2h
Key Fields¶
| Field | Healthy Value | Alert When |
|---|---|---|
Phase |
Synced |
Stuck in Requested, Syncing, Expanding, or Replacing for more than 10 minutes |
MDState |
active or clean |
degraded, recovering, or missing |
ActiveDevices |
Equal to TotalDevices |
Less than TotalDevices |
FailedDevices |
0 |
Greater than 0 |
DownDevices |
0 |
Greater than 0 |
SyncPercentage |
Empty (fully synced) | Present for extended periods (rebuild stalled) |
Detailed Volume Inspection¶
For a specific volume, inspect the partition-level status:
Key fields in .status.partitions[]:
| Field | Description |
|---|---|
nodeID |
Node hosting this partition |
state |
Requested, Created, Syncing, Synced, Faulty, Missing, Replacing, Deleting |
sizeBytes |
Partition size |
updatedAt |
Last state change timestamp |
Device Health¶
Monitor NVMe drive usage and free space via MeshStorNodeDevice (msnd):
NAME NODE DEVICE MODEL SIZE FREE VOLS AGE
node1-nvme0n1 node1 nvme0n1 Samsung 990 PRO 1.0TB 200.0GB 8 7d
node2-nvme0n1 node2 nvme0n1 Samsung 990 PRO 1.0TB 800.0GB 2 7d
What to Watch¶
| Condition | Action |
|---|---|
FREE approaching 0 |
Add drives or rebalance volumes. New volumes will fail to provision if no free space exists. |
VOLS much higher on one node |
Placement is skewed — check if node labels or network issues are limiting placement options. |
Device disappears from msnd |
Node may have lost the drive or the node pod is not running. |
Volume Conditions¶
CSI volume conditions are reported to Kubernetes and visible in PV events:
MeshStor sets abnormal: true when the MD state is not active or clean, and includes details about failed or down devices.
Alerting Recommendations¶
Critical¶
- Any volume with
FailedDevices > 0— a member has been permanently marked faulty - Any volume stuck in
Replacingfor more than 30 minutes — replacement may be blocked
Warning¶
- Any volume with
DownDevices > 0— a member is temporarily unreachable - Any device with less than 10% free space — new volumes may fail to provision
- Any volume in
Syncingphase for more than 1 hour — rebuild may be stalled
Example: Watch All Volumes¶
This streams changes as they happen — useful for monitoring during maintenance windows.
What's Next¶
- Self-Healing — automatic recovery from node and network failures
- Volume Relocation — how volumes move between nodes
- Volume Expansion — grow a volume online without downtime
- Common Issues — troubleshoot problems you discover while monitoring