Internals¶

This page describes the components that maintain the MeshStor data path — the controller StatefulSet, the node DaemonSet, the reconciliation loop, and the custom resources. For the layered data path itself (how a write travels from a pod to a sector on disk), see Architecture.

MeshStor CSI runs as a single Go binary deployed in two roles within your Kubernetes cluster. There are no external storage daemons, databases, or control planes — everything runs inside standard Kubernetes workloads.

Components¶

flowchart TB
    subgraph "Kubernetes Cluster"
        subgraph "Controller (StatefulSet, 1 replica)"
            C[meshstor-csi-controller]
            CP[csi-provisioner]
            CR[csi-resizer]
        end

        subgraph "Node 1 (DaemonSet)"
            N1[meshstor-csi-node]
            NR1[node-driver-registrar]
            D1[("NVMe Drive\nnvme0n1")]
        end

        subgraph "Node 2 (DaemonSet)"
            N2[meshstor-csi-node]
            NR2[node-driver-registrar]
            D2[("NVMe Drive\nnvme1n1")]
        end

        CRD1[("MeshStorVolume CR")]
        CRD2[("MeshStorNodeDevice CR")]

        C --> CRD1
        N1 --> CRD1
        N2 --> CRD1
        N1 --> CRD2
        N2 --> CRD2
        N1 -- "NVMe-oF" --- N2
        N1 --> D1
        N2 --> D2
    end

Single Binary, Two Roles¶

The same meshstor-csi binary serves both roles. Role detection is automatic:

If /sys/kernel/config/nvmet exists (NVMe-oF target configfs) → node role
Otherwise → controller role

This can be overridden with the --role flag, but auto-detection is the standard deployment.

Controller¶

Deployed as a StatefulSet with 1 replica. The controller never touches data — it orchestrates volume lifecycle through Kubernetes CRs.

Responsibilities:

CreateVolume / DeleteVolume — creates and deletes MeshStorVolume custom resources
Node scoring — selects which nodes should host partitions based on available capacity, network latency, and RDMA support
Capacity reporting — aggregates free space from MeshStorNodeDevice CRs
Volume expansion — updates capacity in the volume CR and sets the phase to Expanding

Sidecars:

csi-provisioner — watches PVCs and calls CreateVolume/DeleteVolume
csi-resizer — watches PVC resize requests and calls ControllerExpandVolume

Node¶

Deployed as a DaemonSet on every node. The node plugin manages the physical storage stack.

Responsibilities:

Partition management — creates and removes GPT partitions on local NVMe drives
NVMe-oF export — configures the NVMe-oF target subsystem to expose partitions to remote nodes
NVMe-oF import — connects to remote nodes and imports their partitions
MD RAID assembly — creates and manages MD RAID1/RAID10 arrays from local and remote partitions
XFS formatting — formats new arrays and grows the filesystem during expansion
Mount/unmount — stages volumes to the staging path and bind-mounts to the target path
Device inventory — maintains MeshStorNodeDevice CRs with per-drive free space and volume counts

Sidecars:

node-driver-registrar — registers the CSI driver with the kubelet
liveness-probe — monitors driver health

NVMe-oF subsystem naming. Exported partitions use deterministic NQNs in the form nqn.2025-12.io.meshstor:<transport>:<source-node>:<target-node> — for example nqn.2025-12.io.meshstor:tcp:node1:node2 for a partition on node1 exported to node2 over TCP. The format makes subsystems predictable from node identities and greppable in nvme list-subsys output or under /sys/kernel/config/nvmet/subsystems/ when debugging.

Reconciliation Loop¶

The node plugin runs a reconciliation loop triggered by the liveness probe every 10 seconds. Each cycle is idempotent — it can be interrupted and restarted safely.

Processing order:

Stop foreign MD devices — stop MD arrays for volumes not assigned to this node
Collect partitions for export — create requested partitions and prepare them for NVMe-oF export
Export partitions — configure NVMe-oF target subsystems and namespaces
Process deleting partitions — remove partitions marked for deletion
Add imported partitions — reconnect remote partitions to their MD arrays
Process member replacements — swap faulty or missing members with new partitions
Process volume expansion — grow partitions and MD arrays for expanding volumes
Update device stats — refresh MeshStorNodeDevice CRs with current free space

Custom Resources¶

MeshStor uses two cluster-scoped CRDs to track state. Both are managed entirely by the driver — operators do not create them manually.

MeshStorVolume (`msvol`)¶

Tracks the desired and actual state of each volume. Created by the controller on CreateVolume, updated by both controller and node plugins.

kubectl get msvol

NAME                  PHASE    MDSTATE   TOTAL   ACTIVE   FAILED   DOWN   SYNC   AGE
pvc-a1b2c3d4-...     Synced   active    2       2        0        0             5m

See CRD Specification for the complete field reference.

MeshStorNodeDevice (`msnd`)¶

Tracks per-node block device inventory. Created and updated by each node plugin during reconciliation.

kubectl get msnd

NAME                    NODE        DEVICE    MODEL              SIZE        FREE        VOLS   AGE
node1-nvme0n1           node1       nvme0n1   Samsung 990 PRO    1.0TB       800.0GB     2      1h
node2-nvme0n1           node2       nvme0n1   Samsung 990 PRO    1.0TB       950.0GB     1      1h

Privilege Requirements¶

Both controller and node pods run as privileged containers with SYS_ADMIN capability. The node plugin requires this for:

GPT partition creation and removal (/dev/ access)
NVMe-oF target configuration (/sys/kernel/config/nvmet)
MD RAID management (mdadm)
Device probing (udevadm, partprobe)
Mount operations (bind mounts for pod volumes)

The node DaemonSet uses hostNetwork: true to enable direct NVMe-oF TCP/RDMA communication between nodes.

What's Next¶

Replication — how MeshStor uses MD RAID across nodes
Prerequisites — hardware and kernel requirements