Skip to content

Internals

This page describes the components that maintain the MeshStor data path — the controller StatefulSet, the node DaemonSet, the reconciliation loop, and the custom resources. For the layered data path itself (how a write travels from a pod to a sector on disk), see Architecture.

MeshStor CSI runs as a single Go binary deployed in two roles within your Kubernetes cluster. There are no external storage daemons, databases, or control planes — everything runs inside standard Kubernetes workloads.

Single Binary, Two Roles

The same meshstor-csi binary serves both roles. Role is set with the --role flag.

Controller

Deployed as a StatefulSet with 1 replica. The controller never touches data — it orchestrates volume lifecycle through Kubernetes CRs.

Responsibilities:

  • CreateVolume / DeleteVolume — creates and deletes MeshStorVolume custom resources
  • Capacity reporting — aggregates free space from MeshStorNodeDevice CRs
  • Volume expansion — updates capacity in the volume CR and sets the phase to Expanding. The actual resize runs on the node side under NodeExpandVolume.

Sidecars:

  • csi-provisioner — watches PVCs and calls CreateVolume/DeleteVolume
  • csi-resizer — watches PVC resize requests and calls ControllerExpandVolume
  • livenessprobe — exposes the CSI Probe RPC on a health endpoint for kubelet liveness checks

Node

Deployed as a DaemonSet on every node. The node plugin manages the physical storage stack.

Responsibilities:

  • Partition management — creates, grows, and removes GPT partitions on local NVMe drives
  • NVMe-oF export — configures the NVMe-oF target subsystem to expose partitions to remote nodes
  • NVMe-oF import — connects to remote nodes and imports their partitions
  • MD RAID assembly — creates and manages MD RAID1/RAID10 arrays from local and remote partitions
  • XFS formatting — formats new arrays and grows the filesystem during expansion
  • Mount/unmount — stages volumes to the staging path and bind-mounts to the target path
  • NodeExpandVolume — drives in-place partition growth (or, when adjacent free space is missing, swaps in a freshly-provisioned replacement partition) and the subsequent mdadm --grow once every member has caught up
  • Device inventory — maintains MeshStorNodeDevice CRs with per-drive free space, volume counts, and a driver-alive heartbeat

Sidecars:

  • node-driver-registrar — registers the CSI driver with the kubelet
  • livenessprobe — kubelet liveness gate; its periodic CSI Probe RPC is what drives the reconciliation loop below

NVMe-oF subsystem naming. Exported partitions use deterministic NQNs in the form nqn.2025-12.io.meshstor:<transport>:<source-node>:<target-node> — for example nqn.2025-12.io.meshstor:tcp:node1:node2 for a partition on node1 exported to node2 over TCP. The format makes subsystems predictable from node identities and greppable in nvme list-subsys output or under /sys/kernel/config/nvmet/subsystems/ when debugging.

Reconciliation Loop

The node plugin runs a reconciliation loop triggered and monitored by the liveness probe every 10 seconds. Each cycle is idempotent — it can be interrupted and restarted safely.

Processing order:

  1. Stop foreign MD devices — stop MD arrays for volumes not assigned to this node
  2. Collect partitions for export — create requested partitions and prepare them for NVMe-oF export
  3. Export partitions — configure NVMe-oF target subsystems and namespaces
  4. Grow exported partitions — for volumes in the Expanding phase
  5. Process deleting partitions — remove partitions marked for deletion
  6. Process member replacements — swap faulty or missing members with new partitions
  7. Schedule replacements — mark Missing/Faulty members past memberMissingTimeout for replacement on healthy nodes
  8. Add imported partitions — reconnect remote partitions to their MD arrays
  9. Update device stats — refresh MeshStorNodeDevice CRs with current free space, volume counts, and the heartbeat used by other nodes' health checks

Custom Resources

MeshStor uses two cluster-scoped CRDs to track state. Both are managed entirely by the driver — operators should avoid editing or deleting them.

MeshStorVolume (msvol)

Tracks the desired and actual state of each volume. Created by the controller on CreateVolume, updated by both controller and node plugins.

It is also the coordination channel between node plugins — there is no direct RPC between nodes. When the consumer node needs a replica partition on a remote node, it records the request in the volume's .spec.partitions[]. The remote node's reconciler picks the request up, creates the partition, configures the NVMe-oF target subsystem and namespace, and writes the resulting state back to .status.partitions[]. The same CR drives member replacement, drain migration, and expansion: each side observes the CR, performs its local action, and updates its slice of the status. Everything the driver does across nodes is expressed as writes to this single resource.

kubectl get msvol
NAME               PHASE    MDSTATE   READY   DEGRADED   SYNC   NODE       AGE
pvc-cd1038a7-...   Synced   active    2/2     0                 mf-01-02   5m

MeshStorNodeDevice (msnd)

Tracks per-node block device inventory. Created and updated by each node plugin during reconciliation.

Node plugins read every other node's MeshStorNodeDevice CRs when deciding where to place replica partitions. Free space and per-node RDMA/TCP reachability surfaced here feed directly into the node scoring algorithm — the consumer node uses that information to pick the best remote candidates and record the placement in the MeshStorVolume CR described above. Without these CRs, a node plugin would have no visibility into the rest of the cluster's drives and scoring would degenerate to round-robin.

kubectl get msnd
NAME            MODEL                   SERIAL   LOCALPARTITIONS   REMOTEPARTITIONS   UNKNOWNPARTITIONS   MULTIQUEUE   BIGGESTUSABLEFREESPACE   SIZE      SECTORSIZE   UPDATEDAT
node1-nvme0n1   WD_BLACK SN7100 1TB     244...   1                 1                  0                   20           931.5Gi                  931.5Gi   4096         5s
node2-nvme1n1   WD_BLACK SN7100 500GB   260...   2                 0                  0                   16           465.8Gi                  465.8Gi   512          5s

Privilege Requirements

The node plugin runs as a privileged container. It needs this for:

  • GPT partition creation and removal (/dev/ access)
  • NVMe-oF target configuration (/sys/kernel/config/nvmet)
  • MD RAID management (mdadm)
  • Device probing (udevadm, partprobe)
  • Mount operations (bind mounts for pod volumes)

The node DaemonSet uses hostNetwork: true to enable direct NVMe-oF TCP/RDMA communication between nodes.

What's Next