Internals¶
This page describes the components that maintain the MeshStor data path — the controller StatefulSet, the node DaemonSet, the reconciliation loop, and the custom resources. For the layered data path itself (how a write travels from a pod to a sector on disk), see Architecture.
MeshStor CSI runs as a single Go binary deployed in two roles within your Kubernetes cluster. There are no external storage daemons, databases, or control planes — everything runs inside standard Kubernetes workloads.
Single Binary, Two Roles¶
The same meshstor-csi binary serves both roles. Role is set with the --role flag.
Controller¶
Deployed as a StatefulSet with 1 replica. The controller never touches data — it orchestrates volume lifecycle through Kubernetes CRs.
Responsibilities:
- CreateVolume / DeleteVolume — creates and deletes
MeshStorVolumecustom resources - Capacity reporting — aggregates free space from
MeshStorNodeDeviceCRs - Volume expansion — updates capacity in the volume CR and sets the phase to
Expanding. The actual resize runs on the node side underNodeExpandVolume.
Sidecars:
csi-provisioner— watches PVCs and calls CreateVolume/DeleteVolumecsi-resizer— watches PVC resize requests and calls ControllerExpandVolumelivenessprobe— exposes the CSI Probe RPC on a health endpoint for kubelet liveness checks
Node¶
Deployed as a DaemonSet on every node. The node plugin manages the physical storage stack.
Responsibilities:
- Partition management — creates, grows, and removes GPT partitions on local NVMe drives
- NVMe-oF export — configures the NVMe-oF target subsystem to expose partitions to remote nodes
- NVMe-oF import — connects to remote nodes and imports their partitions
- MD RAID assembly — creates and manages MD RAID1/RAID10 arrays from local and remote partitions
- XFS formatting — formats new arrays and grows the filesystem during expansion
- Mount/unmount — stages volumes to the staging path and bind-mounts to the target path
- NodeExpandVolume — drives in-place partition growth (or, when adjacent free space is missing, swaps in a freshly-provisioned replacement partition) and the subsequent
mdadm --growonce every member has caught up - Device inventory — maintains
MeshStorNodeDeviceCRs with per-drive free space, volume counts, and a driver-alive heartbeat
Sidecars:
node-driver-registrar— registers the CSI driver with the kubeletlivenessprobe— kubelet liveness gate; its periodic CSI Probe RPC is what drives the reconciliation loop below
NVMe-oF subsystem naming. Exported partitions use deterministic NQNs in the form nqn.2025-12.io.meshstor:<transport>:<source-node>:<target-node> — for example nqn.2025-12.io.meshstor:tcp:node1:node2 for a partition on node1 exported to node2 over TCP. The format makes subsystems predictable from node identities and greppable in nvme list-subsys output or under /sys/kernel/config/nvmet/subsystems/ when debugging.
Reconciliation Loop¶
The node plugin runs a reconciliation loop triggered and monitored by the liveness probe every 10 seconds. Each cycle is idempotent — it can be interrupted and restarted safely.
Processing order:
- Stop foreign MD devices — stop MD arrays for volumes not assigned to this node
- Collect partitions for export — create requested partitions and prepare them for NVMe-oF export
- Export partitions — configure NVMe-oF target subsystems and namespaces
- Grow exported partitions — for volumes in the
Expandingphase - Process deleting partitions — remove partitions marked for deletion
- Process member replacements — swap faulty or missing members with new partitions
- Schedule replacements — mark
Missing/Faultymembers pastmemberMissingTimeoutfor replacement on healthy nodes - Add imported partitions — reconnect remote partitions to their MD arrays
- Update device stats — refresh
MeshStorNodeDeviceCRs with current free space, volume counts, and the heartbeat used by other nodes' health checks
Custom Resources¶
MeshStor uses two cluster-scoped CRDs to track state. Both are managed entirely by the driver — operators should avoid editing or deleting them.
MeshStorVolume (msvol)¶
Tracks the desired and actual state of each volume. Created by the controller on CreateVolume, updated by both controller and node plugins.
It is also the coordination channel between node plugins — there is no direct RPC between nodes. When the consumer node needs a replica partition on a remote node, it records the request in the volume's .spec.partitions[]. The remote node's reconciler picks the request up, creates the partition, configures the NVMe-oF target subsystem and namespace, and writes the resulting state back to .status.partitions[]. The same CR drives member replacement, drain migration, and expansion: each side observes the CR, performs its local action, and updates its slice of the status. Everything the driver does across nodes is expressed as writes to this single resource.
MeshStorNodeDevice (msnd)¶
Tracks per-node block device inventory. Created and updated by each node plugin during reconciliation.
Node plugins read every other node's MeshStorNodeDevice CRs when deciding where to place replica partitions. Free space and per-node RDMA/TCP reachability surfaced here feed directly into the node scoring algorithm — the consumer node uses that information to pick the best remote candidates and record the placement in the MeshStorVolume CR described above. Without these CRs, a node plugin would have no visibility into the rest of the cluster's drives and scoring would degenerate to round-robin.
NAME MODEL SERIAL LOCALPARTITIONS REMOTEPARTITIONS UNKNOWNPARTITIONS MULTIQUEUE BIGGESTUSABLEFREESPACE SIZE SECTORSIZE UPDATEDAT
node1-nvme0n1 WD_BLACK SN7100 1TB 244... 1 1 0 20 931.5Gi 931.5Gi 4096 5s
node2-nvme1n1 WD_BLACK SN7100 500GB 260... 2 0 0 16 465.8Gi 465.8Gi 512 5s
Privilege Requirements¶
The node plugin runs as a privileged container. It needs this for:
- GPT partition creation and removal (
/dev/access) - NVMe-oF target configuration (
/sys/kernel/config/nvmet) - MD RAID management (
mdadm) - Device probing (
udevadm,partprobe) - Mount operations (bind mounts for pod volumes)
The node DaemonSet uses hostNetwork: true to enable direct NVMe-oF TCP/RDMA communication between nodes.
What's Next¶
- Replication — how MeshStor uses MD RAID across nodes
- Prerequisites — hardware and kernel requirements