Monitoring¶
LLM-generated draft — not proofread
This page was drafted by an LLM and has not been reviewed by a human. Treat every claim as unverified until a maintainer signs off.
MeshStor exposes volume health through Kubernetes custom resources. This page explains what to watch and how to set up alerts.
Volume Health¶
The primary health indicator is the MeshStorVolume (msvol) custom resource:
NAME PHASE MDSTATE READY DEGRADED SYNC NODE AGE
pvc-cd1038a7-... Synced active 2/2 0 mf-01-02 1h
pvc-7f3a91e2-... Syncing recovering 1/2 1 45.2% mf-01-03 30m
pvc-1a2b3c4d-... Synced degraded 2/3 1 mf-01-04 2h
The Total, Active, Failed, and Missing device-counter columns are still available behind kubectl get msvol -o wide for deeper inspection.
Key Fields¶
| Field | Healthy Value | Alert When |
|---|---|---|
Phase |
Synced |
Stuck in Requested, Syncing, Expanding, or Replacing for more than 10 minutes |
MDState |
active or clean |
degraded, recovering, or missing |
Ready |
N/N (e.g. 2/2) |
Active count below total — a member is missing or rebuilding |
Degraded |
0 |
Greater than 0 — at least one member is faulty, missing, or out of sync |
SyncPercentage |
Empty (fully synced) | Present for extended periods (rebuild stalled) |
Detailed Volume Inspection¶
For a specific volume, inspect the partition-level status:
Key fields in .status.partitions[]:
| Field | Description |
|---|---|
nodeID |
Node hosting this partition |
state |
Requested, Created, Syncing, Synced, Spare, Faulty, Missing, Replacing, Deleting |
sizeBytes |
Partition size |
updatedAt |
Last state change timestamp |
Device Health¶
Monitor NVMe drive usage and free space via MeshStorNodeDevice (msnd):
NAME MODEL SERIAL LOCALPARTITIONS REMOTEPARTITIONS UNKNOWNPARTITIONS MULTIQUEUE BIGGESTUSABLEFREESPACE SIZE SECTORSIZE UPDATEDAT
node1-nvme0n1 WD_BLACK SN7100 1TB 244... 1 1 0 20 931.5Gi 931.5Gi 4096 5s
node2-nvme1n1 WD_BLACK SN7100 500GB 260... 2 0 0 16 465.8Gi 465.8Gi 512 5s
LocalPartitions counts MeshStor partitions whose consumer pod runs on this node; RemotePartitions counts partitions that exist on this drive but back a volume mounted elsewhere; UnknownPartitions counts MeshStor-owned partitions whose MeshStorVolume CR is gone.
The CR name encodes the node and device (<node>-<device>); kubectl get msnd <name> -o yaml exposes the full .spec.node / .spec.device fields when the embedded form is not enough.
What to Watch¶
| Condition | Action |
|---|---|
BIGGESTUSABLEFREESPACE approaching 0 |
Add drives or rebalance volumes. New volumes will be placed only on other nodes, so pods scheduled here pay NVMe-oF latency on every I/O. |
LOCALPARTITIONS much higher on one node |
Consumer placement is skewed — check whether scheduler hints or PVC affinity are pinning workloads to this node. |
REMOTEPARTITIONS much higher on one node |
Replica-target placement is skewed — check if node labels or network issues are limiting placement options on the other nodes. |
UNKNOWNPARTITIONS greater than 0 |
The drive holds ex-MeshStor partitions whose volume CR is gone. Run the bundled meshstor-cleanup helper on the affected node — see Disk Cleanup. |
UPDATEDAT falls more than ~60 s behind wall clock |
The node's CSI driver has stopped reporting; placement and member replacement skip this node until the heartbeat returns. |
Device disappears from msnd |
Node may have lost the drive or the node pod is not running. |
Volume Conditions¶
CSI volume conditions are reported to Kubernetes and visible in PV events:
MeshStor sets abnormal: true when the MD state is not active or clean, and includes details about failed or down devices.
Alerting Recommendations¶
Critical¶
- Any volume with
Failed > 0(visible viakubectl get msvol -o wide, or as a member inFaultystate in-o yaml) — a member has been permanently marked faulty - Any volume stuck in
Replacingfor more than 30 minutes — replacement may be blocked
Warning¶
- Any volume with
Missing > 0(visible viakubectl get msvol -o wide, or as a member inMissingstate in-o yaml) — a member is temporarily unreachable - Any device with less than 10% free space — new volumes may fail to provision
- Any volume in
Syncingphase for more than 1 hour — rebuild may be stalled
Example: Watch All Volumes¶
This streams changes as they happen — useful for monitoring during maintenance windows.
What's Next¶
- Self-Healing — automatic recovery from node and network failures
- Volume Relocation — how volumes move between nodes
- Volume Expansion — grow a volume online without downtime