Internals¶

This page describes the components that maintain the MeshStor data path — the controller StatefulSet, the node DaemonSet, the reconciliation loop, and the custom resources. For the layered data path itself (how a write travels from a pod to a sector on disk), see Architecture.

MeshStor CSI runs as a single Go binary deployed in two roles within your Kubernetes cluster. There are no external storage daemons, databases, or control planes — everything runs inside standard Kubernetes workloads.

Single Binary, Two Roles¶

The same meshstor-csi binary serves both roles. Role is set with the --role flag.

Controller¶

Deployed as a StatefulSet with 1 replica. The controller never touches data — it orchestrates volume lifecycle through Kubernetes CRs.

Responsibilities:

CreateVolume / DeleteVolume — creates and deletes MeshStorVolume custom resources
Capacity reporting — aggregates free space from MeshStorNodeDevice CRs
Volume expansion — updates capacity in the volume CR and sets the phase to Expanding. The actual resize runs on the node side under NodeExpandVolume.

Sidecars:

csi-provisioner — watches PVCs and calls CreateVolume/DeleteVolume
csi-resizer — watches PVC resize requests and calls ControllerExpandVolume
livenessprobe — exposes the CSI Probe RPC on a health endpoint for kubelet liveness checks

Node¶

Deployed as a DaemonSet on every node. The node plugin manages the physical storage stack.

Responsibilities:

Partition management — creates, grows, and removes GPT partitions on local NVMe drives
NVMe-oF export — configures the NVMe-oF target subsystem to expose partitions to remote nodes
NVMe-oF import — connects to remote nodes and imports their partitions
MD RAID assembly — creates and manages MD RAID1/RAID10 arrays from local and remote partitions
XFS formatting — formats new arrays and grows the filesystem during expansion
Mount/unmount — stages volumes to the staging path and bind-mounts to the target path
NodeExpandVolume — drives in-place partition growth (or, when adjacent free space is missing, swaps in a freshly-provisioned replacement partition) and the subsequent mdadm --grow once every member has caught up
Device inventory — maintains MeshStorNodeDevice CRs with per-drive free space, volume counts, and a driver-alive heartbeat

Sidecars:

node-driver-registrar — registers the CSI driver with the kubelet
livenessprobe — kubelet liveness gate; its periodic CSI Probe RPC is what drives the reconciliation loop below

NVMe-oF subsystem naming. Exported partitions use deterministic NQNs in the form nqn.2025-12.io.meshstor:<transport>:<source-node>:<target-node> — for example nqn.2025-12.io.meshstor:tcp:node1:node2 for a partition on node1 exported to node2 over TCP. The format makes subsystems predictable from node identities and greppable in nvme list-subsys output or under /sys/kernel/config/nvmet/subsystems/ when debugging.

Reconciliation Loop¶

The node plugin runs a reconciliation loop triggered and monitored by the liveness probe every 10 seconds. Each cycle is idempotent — it can be interrupted and restarted safely.

Processing order:

Stop foreign MD devices — stop MD arrays for volumes not assigned to this node
Collect partitions for export — create requested partitions and prepare them for NVMe-oF export
Export partitions — configure NVMe-oF target subsystems and namespaces
Grow exported partitions — for volumes in the Expanding phase
Process deleting partitions — remove partitions marked for deletion
Process member replacements — swap faulty or missing members with new partitions
Schedule replacements — mark Missing/Faulty members past memberMissingTimeout for replacement on healthy nodes
Add imported partitions — reconnect remote partitions to their MD arrays
Update device stats — refresh MeshStorNodeDevice CRs with current free space, volume counts, and the heartbeat used by other nodes' health checks

Custom Resources¶

MeshStor uses two cluster-scoped CRDs to track state. Both are managed entirely by the driver — operators should avoid editing or deleting them.

MeshStorVolume (`msvol`)¶

Tracks the desired and actual state of each volume. Created by the controller on CreateVolume, updated by both controller and node plugins.

It is also the coordination channel between node plugins — there is no direct RPC between nodes. When the consumer node needs a replica partition on a remote node, it records the request in the volume's .spec.partitions[]. The remote node's reconciler picks the request up, creates the partition, configures the NVMe-oF target subsystem and namespace, and writes the resulting state back to .status.partitions[]. The same CR drives member replacement, drain migration, and expansion: each side observes the CR, performs its local action, and updates its slice of the status. Everything the driver does across nodes is expressed as writes to this single resource.

kubectl get msvol

NAME               PHASE    MDSTATE   READY   DEGRADED   SYNC   NODE       AGE
pvc-cd1038a7-...   Synced   active    2/2     0                 mf-01-02   5m

MeshStorNodeDevice (`msnd`)¶

Tracks per-node block device inventory. Created and updated by each node plugin during reconciliation.

Node plugins read every other node's MeshStorNodeDevice CRs when deciding where to place replica partitions. Free space and per-node RDMA/TCP reachability surfaced here feed directly into the node scoring algorithm — the consumer node uses that information to pick the best remote candidates and record the placement in the MeshStorVolume CR described above. Without these CRs, a node plugin would have no visibility into the rest of the cluster's drives and scoring would degenerate to round-robin.

kubectl get msnd

NAME            MODEL                   SERIAL   LOCALPARTITIONS   REMOTEPARTITIONS   UNKNOWNPARTITIONS   MULTIQUEUE   BIGGESTUSABLEFREESPACE   SIZE      SECTORSIZE   UPDATEDAT
node1-nvme0n1   WD_BLACK SN7100 1TB     244...   1                 1                  0                   20           931.5Gi                  931.5Gi   4096         5s
node2-nvme1n1   WD_BLACK SN7100 500GB   260...   2                 0                  0                   16           465.8Gi                  465.8Gi   512          5s

Privilege Requirements¶

The node plugin runs as a privileged container. It needs this for:

GPT partition creation and removal (/dev/ access)
NVMe-oF target configuration (/sys/kernel/config/nvmet)
MD RAID management (mdadm)
Device probing (udevadm, partprobe)
Mount operations (bind mounts for pod volumes)

The node DaemonSet uses hostNetwork: true to enable direct NVMe-oF TCP/RDMA communication between nodes.

What's Next¶

Replication — how MeshStor uses MD RAID across nodes
Prerequisites — hardware and kernel requirements