Monitoring¶

LLM-generated draft — not proofread

This page was drafted by an LLM and has not been reviewed by a human. Treat every claim as unverified until a maintainer signs off.

MeshStor exposes volume health through Kubernetes custom resources. This page explains what to watch and how to set up alerts.

Volume Health¶

The primary health indicator is the MeshStorVolume (msvol) custom resource:

kubectl get msvol

NAME               PHASE     MDSTATE      READY   DEGRADED   SYNC    NODE       AGE
pvc-cd1038a7-...   Synced    active       2/2     0                  mf-01-02   1h
pvc-7f3a91e2-...   Syncing   recovering   1/2     1          45.2%   mf-01-03   30m
pvc-1a2b3c4d-...   Synced    degraded     2/3     1                  mf-01-04   2h

The Total, Active, Failed, and Missing device-counter columns are still available behind kubectl get msvol -o wide for deeper inspection.

Key Fields¶

Field	Healthy Value	Alert When
`Phase`	`Synced`	Stuck in `Requested`, `Syncing`, `Expanding`, or `Replacing` for more than 10 minutes
`MDState`	`active` or `clean`	`degraded`, `recovering`, or missing
`Ready`	`N/N` (e.g. `2/2`)	Active count below total — a member is missing or rebuilding
`Degraded`	`0`	Greater than `0` — at least one member is faulty, missing, or out of sync
`SyncPercentage`	Empty (fully synced)	Present for extended periods (rebuild stalled)

Detailed Volume Inspection¶

For a specific volume, inspect the partition-level status:

kubectl get msvol <volume-name> -o yaml

Key fields in .status.partitions[]:

Field	Description
`nodeID`	Node hosting this partition
`state`	`Requested`, `Created`, `Syncing`, `Synced`, `Spare`, `Faulty`, `Missing`, `Replacing`, `Deleting`
`sizeBytes`	Partition size
`updatedAt`	Last state change timestamp

Device Health¶

Monitor NVMe drive usage and free space via MeshStorNodeDevice (msnd):

kubectl get msnd

NAME            MODEL                   SERIAL   LOCALPARTITIONS   REMOTEPARTITIONS   UNKNOWNPARTITIONS   MULTIQUEUE   BIGGESTUSABLEFREESPACE   SIZE      SECTORSIZE   UPDATEDAT
node1-nvme0n1   WD_BLACK SN7100 1TB     244...   1                 1                  0                   20           931.5Gi                  931.5Gi   4096         5s
node2-nvme1n1   WD_BLACK SN7100 500GB   260...   2                 0                  0                   16           465.8Gi                  465.8Gi   512          5s

LocalPartitions counts MeshStor partitions whose consumer pod runs on this node; RemotePartitions counts partitions that exist on this drive but back a volume mounted elsewhere; UnknownPartitions counts MeshStor-owned partitions whose MeshStorVolume CR is gone.

The CR name encodes the node and device (<node>-<device>); kubectl get msnd <name> -o yaml exposes the full .spec.node / .spec.device fields when the embedded form is not enough.

What to Watch¶

Condition	Action
`BIGGESTUSABLEFREESPACE` approaching 0	Add drives or rebalance volumes. New volumes will be placed only on other nodes, so pods scheduled here pay NVMe-oF latency on every I/O.
`LOCALPARTITIONS` much higher on one node	Consumer placement is skewed — check whether scheduler hints or PVC affinity are pinning workloads to this node.
`REMOTEPARTITIONS` much higher on one node	Replica-target placement is skewed — check if node labels or network issues are limiting placement options on the other nodes.
`UNKNOWNPARTITIONS` greater than 0	The drive holds ex-MeshStor partitions whose volume CR is gone. Run the bundled `meshstor-cleanup` helper on the affected node — see Disk Cleanup.
`UPDATEDAT` falls more than ~60 s behind wall clock	The node's CSI driver has stopped reporting; placement and member replacement skip this node until the heartbeat returns.
Device disappears from `msnd`	Node may have lost the drive or the node pod is not running.

Volume Conditions¶

CSI volume conditions are reported to Kubernetes and visible in PV events:

kubectl describe pv <pv-name>

MeshStor sets abnormal: true when the MD state is not active or clean, and includes details about failed or down devices.

Alerting Recommendations¶

Critical¶

Any volume with Failed > 0 (visible via kubectl get msvol -o wide, or as a member in Faulty state in -o yaml) — a member has been permanently marked faulty
Any volume stuck in Replacing for more than 30 minutes — replacement may be blocked

Warning¶

Any volume with Missing > 0 (visible via kubectl get msvol -o wide, or as a member in Missing state in -o yaml) — a member is temporarily unreachable
Any device with less than 10% free space — new volumes may fail to provision
Any volume in Syncing phase for more than 1 hour — rebuild may be stalled

Example: Watch All Volumes¶

kubectl get msvol -w

This streams changes as they happen — useful for monitoring during maintenance windows.

What's Next¶

Self-Healing — automatic recovery from node and network failures
Volume Relocation — how volumes move between nodes
Volume Expansion — grow a volume online without downtime