Skip to content

Monitoring

LLM-generated draft — not proofread

This page was drafted by an LLM and has not been reviewed by a human. Treat every claim as unverified until a maintainer signs off.

MeshStor exposes volume health through Kubernetes custom resources. This page explains what to watch and how to set up alerts.

Volume Health

The primary health indicator is the MeshStorVolume (msvol) custom resource:

kubectl get msvol
NAME               PHASE     MDSTATE      READY   DEGRADED   SYNC    NODE       AGE
pvc-cd1038a7-...   Synced    active       2/2     0                  mf-01-02   1h
pvc-7f3a91e2-...   Syncing   recovering   1/2     1          45.2%   mf-01-03   30m
pvc-1a2b3c4d-...   Synced    degraded     2/3     1                  mf-01-04   2h

The Total, Active, Failed, and Missing device-counter columns are still available behind kubectl get msvol -o wide for deeper inspection.

Key Fields

Field Healthy Value Alert When
Phase Synced Stuck in Requested, Syncing, Expanding, or Replacing for more than 10 minutes
MDState active or clean degraded, recovering, or missing
Ready N/N (e.g. 2/2) Active count below total — a member is missing or rebuilding
Degraded 0 Greater than 0 — at least one member is faulty, missing, or out of sync
SyncPercentage Empty (fully synced) Present for extended periods (rebuild stalled)

Detailed Volume Inspection

For a specific volume, inspect the partition-level status:

kubectl get msvol <volume-name> -o yaml

Key fields in .status.partitions[]:

Field Description
nodeID Node hosting this partition
state Requested, Created, Syncing, Synced, Spare, Faulty, Missing, Replacing, Deleting
sizeBytes Partition size
updatedAt Last state change timestamp

Device Health

Monitor NVMe drive usage and free space via MeshStorNodeDevice (msnd):

kubectl get msnd
NAME            MODEL                   SERIAL   LOCALPARTITIONS   REMOTEPARTITIONS   UNKNOWNPARTITIONS   MULTIQUEUE   BIGGESTUSABLEFREESPACE   SIZE      SECTORSIZE   UPDATEDAT
node1-nvme0n1   WD_BLACK SN7100 1TB     244...   1                 1                  0                   20           931.5Gi                  931.5Gi   4096         5s
node2-nvme1n1   WD_BLACK SN7100 500GB   260...   2                 0                  0                   16           465.8Gi                  465.8Gi   512          5s

LocalPartitions counts MeshStor partitions whose consumer pod runs on this node; RemotePartitions counts partitions that exist on this drive but back a volume mounted elsewhere; UnknownPartitions counts MeshStor-owned partitions whose MeshStorVolume CR is gone.

The CR name encodes the node and device (<node>-<device>); kubectl get msnd <name> -o yaml exposes the full .spec.node / .spec.device fields when the embedded form is not enough.

What to Watch

Condition Action
BIGGESTUSABLEFREESPACE approaching 0 Add drives or rebalance volumes. New volumes will be placed only on other nodes, so pods scheduled here pay NVMe-oF latency on every I/O.
LOCALPARTITIONS much higher on one node Consumer placement is skewed — check whether scheduler hints or PVC affinity are pinning workloads to this node.
REMOTEPARTITIONS much higher on one node Replica-target placement is skewed — check if node labels or network issues are limiting placement options on the other nodes.
UNKNOWNPARTITIONS greater than 0 The drive holds ex-MeshStor partitions whose volume CR is gone. Run the bundled meshstor-cleanup helper on the affected node — see Disk Cleanup.
UPDATEDAT falls more than ~60 s behind wall clock The node's CSI driver has stopped reporting; placement and member replacement skip this node until the heartbeat returns.
Device disappears from msnd Node may have lost the drive or the node pod is not running.

Volume Conditions

CSI volume conditions are reported to Kubernetes and visible in PV events:

kubectl describe pv <pv-name>

MeshStor sets abnormal: true when the MD state is not active or clean, and includes details about failed or down devices.

Alerting Recommendations

Critical

  • Any volume with Failed > 0 (visible via kubectl get msvol -o wide, or as a member in Faulty state in -o yaml) — a member has been permanently marked faulty
  • Any volume stuck in Replacing for more than 30 minutes — replacement may be blocked

Warning

  • Any volume with Missing > 0 (visible via kubectl get msvol -o wide, or as a member in Missing state in -o yaml) — a member is temporarily unreachable
  • Any device with less than 10% free space — new volumes may fail to provision
  • Any volume in Syncing phase for more than 1 hour — rebuild may be stalled

Example: Watch All Volumes

kubectl get msvol -w

This streams changes as they happen — useful for monitoring during maintenance windows.

What's Next