Internals¶
This page describes the components that maintain the MeshStor data path — the controller StatefulSet, the node DaemonSet, the reconciliation loop, and the custom resources. For the layered data path itself (how a write travels from a pod to a sector on disk), see Architecture.
MeshStor CSI runs as a single Go binary deployed in two roles within your Kubernetes cluster. There are no external storage daemons, databases, or control planes — everything runs inside standard Kubernetes workloads.
Components¶
flowchart TB
subgraph "Kubernetes Cluster"
subgraph "Controller (StatefulSet, 1 replica)"
C[meshstor-csi-controller]
CP[csi-provisioner]
CR[csi-resizer]
end
subgraph "Node 1 (DaemonSet)"
N1[meshstor-csi-node]
NR1[node-driver-registrar]
D1[("NVMe Drive\nnvme0n1")]
end
subgraph "Node 2 (DaemonSet)"
N2[meshstor-csi-node]
NR2[node-driver-registrar]
D2[("NVMe Drive\nnvme1n1")]
end
CRD1[("MeshStorVolume CR")]
CRD2[("MeshStorNodeDevice CR")]
C --> CRD1
N1 --> CRD1
N2 --> CRD1
N1 --> CRD2
N2 --> CRD2
N1 -- "NVMe-oF" --- N2
N1 --> D1
N2 --> D2
end
Single Binary, Two Roles¶
The same meshstor-csi binary serves both roles. Role detection is automatic:
- If
/sys/kernel/config/nvmetexists (NVMe-oF target configfs) → node role - Otherwise → controller role
This can be overridden with the --role flag, but auto-detection is the standard deployment.
Controller¶
Deployed as a StatefulSet with 1 replica. The controller never touches data — it orchestrates volume lifecycle through Kubernetes CRs.
Responsibilities:
- CreateVolume / DeleteVolume — creates and deletes
MeshStorVolumecustom resources - Node scoring — selects which nodes should host partitions based on available capacity, network latency, and RDMA support
- Capacity reporting — aggregates free space from
MeshStorNodeDeviceCRs - Volume expansion — updates capacity in the volume CR and sets the phase to
Expanding
Sidecars:
csi-provisioner— watches PVCs and calls CreateVolume/DeleteVolumecsi-resizer— watches PVC resize requests and calls ControllerExpandVolume
Node¶
Deployed as a DaemonSet on every node. The node plugin manages the physical storage stack.
Responsibilities:
- Partition management — creates and removes GPT partitions on local NVMe drives
- NVMe-oF export — configures the NVMe-oF target subsystem to expose partitions to remote nodes
- NVMe-oF import — connects to remote nodes and imports their partitions
- MD RAID assembly — creates and manages MD RAID1/RAID10 arrays from local and remote partitions
- XFS formatting — formats new arrays and grows the filesystem during expansion
- Mount/unmount — stages volumes to the staging path and bind-mounts to the target path
- Device inventory — maintains
MeshStorNodeDeviceCRs with per-drive free space and volume counts
Sidecars:
node-driver-registrar— registers the CSI driver with the kubeletliveness-probe— monitors driver health
NVMe-oF subsystem naming. Exported partitions use deterministic NQNs in the form nqn.2025-12.io.meshstor:<transport>:<source-node>:<target-node> — for example nqn.2025-12.io.meshstor:tcp:node1:node2 for a partition on node1 exported to node2 over TCP. The format makes subsystems predictable from node identities and greppable in nvme list-subsys output or under /sys/kernel/config/nvmet/subsystems/ when debugging.
Reconciliation Loop¶
The node plugin runs a reconciliation loop triggered by the liveness probe every 10 seconds. Each cycle is idempotent — it can be interrupted and restarted safely.
Processing order:
- Stop foreign MD devices — stop MD arrays for volumes not assigned to this node
- Collect partitions for export — create requested partitions and prepare them for NVMe-oF export
- Export partitions — configure NVMe-oF target subsystems and namespaces
- Process deleting partitions — remove partitions marked for deletion
- Add imported partitions — reconnect remote partitions to their MD arrays
- Process member replacements — swap faulty or missing members with new partitions
- Process volume expansion — grow partitions and MD arrays for expanding volumes
- Update device stats — refresh
MeshStorNodeDeviceCRs with current free space
Custom Resources¶
MeshStor uses two cluster-scoped CRDs to track state. Both are managed entirely by the driver — operators do not create them manually.
MeshStorVolume (msvol)¶
Tracks the desired and actual state of each volume. Created by the controller on CreateVolume, updated by both controller and node plugins.
See CRD Specification for the complete field reference.
MeshStorNodeDevice (msnd)¶
Tracks per-node block device inventory. Created and updated by each node plugin during reconciliation.
NAME NODE DEVICE MODEL SIZE FREE VOLS AGE
node1-nvme0n1 node1 nvme0n1 Samsung 990 PRO 1.0TB 800.0GB 2 1h
node2-nvme0n1 node2 nvme0n1 Samsung 990 PRO 1.0TB 950.0GB 1 1h
Privilege Requirements¶
Both controller and node pods run as privileged containers with SYS_ADMIN capability. The node plugin requires this for:
- GPT partition creation and removal (
/dev/access) - NVMe-oF target configuration (
/sys/kernel/config/nvmet) - MD RAID management (
mdadm) - Device probing (
udevadm,partprobe) - Mount operations (bind mounts for pod volumes)
The node DaemonSet uses hostNetwork: true to enable direct NVMe-oF TCP/RDMA communication between nodes.
What's Next¶
- Replication — how MeshStor uses MD RAID across nodes
- Prerequisites — hardware and kernel requirements