Skip to content

Internals

This page describes the components that maintain the MeshStor data path — the controller StatefulSet, the node DaemonSet, the reconciliation loop, and the custom resources. For the layered data path itself (how a write travels from a pod to a sector on disk), see Architecture.

MeshStor CSI runs as a single Go binary deployed in two roles within your Kubernetes cluster. There are no external storage daemons, databases, or control planes — everything runs inside standard Kubernetes workloads.

Components

flowchart TB
    subgraph "Kubernetes Cluster"
        subgraph "Controller (StatefulSet, 1 replica)"
            C[meshstor-csi-controller]
            CP[csi-provisioner]
            CR[csi-resizer]
        end

        subgraph "Node 1 (DaemonSet)"
            N1[meshstor-csi-node]
            NR1[node-driver-registrar]
            D1[("NVMe Drive\nnvme0n1")]
        end

        subgraph "Node 2 (DaemonSet)"
            N2[meshstor-csi-node]
            NR2[node-driver-registrar]
            D2[("NVMe Drive\nnvme1n1")]
        end

        CRD1[("MeshStorVolume CR")]
        CRD2[("MeshStorNodeDevice CR")]

        C --> CRD1
        N1 --> CRD1
        N2 --> CRD1
        N1 --> CRD2
        N2 --> CRD2
        N1 -- "NVMe-oF" --- N2
        N1 --> D1
        N2 --> D2
    end

Single Binary, Two Roles

The same meshstor-csi binary serves both roles. Role detection is automatic:

  • If /sys/kernel/config/nvmet exists (NVMe-oF target configfs) → node role
  • Otherwise → controller role

This can be overridden with the --role flag, but auto-detection is the standard deployment.

Controller

Deployed as a StatefulSet with 1 replica. The controller never touches data — it orchestrates volume lifecycle through Kubernetes CRs.

Responsibilities:

  • CreateVolume / DeleteVolume — creates and deletes MeshStorVolume custom resources
  • Node scoring — selects which nodes should host partitions based on available capacity, network latency, and RDMA support
  • Capacity reporting — aggregates free space from MeshStorNodeDevice CRs
  • Volume expansion — updates capacity in the volume CR and sets the phase to Expanding

Sidecars:

  • csi-provisioner — watches PVCs and calls CreateVolume/DeleteVolume
  • csi-resizer — watches PVC resize requests and calls ControllerExpandVolume

Node

Deployed as a DaemonSet on every node. The node plugin manages the physical storage stack.

Responsibilities:

  • Partition management — creates and removes GPT partitions on local NVMe drives
  • NVMe-oF export — configures the NVMe-oF target subsystem to expose partitions to remote nodes
  • NVMe-oF import — connects to remote nodes and imports their partitions
  • MD RAID assembly — creates and manages MD RAID1/RAID10 arrays from local and remote partitions
  • XFS formatting — formats new arrays and grows the filesystem during expansion
  • Mount/unmount — stages volumes to the staging path and bind-mounts to the target path
  • Device inventory — maintains MeshStorNodeDevice CRs with per-drive free space and volume counts

Sidecars:

  • node-driver-registrar — registers the CSI driver with the kubelet
  • liveness-probe — monitors driver health

NVMe-oF subsystem naming. Exported partitions use deterministic NQNs in the form nqn.2025-12.io.meshstor:<transport>:<source-node>:<target-node> — for example nqn.2025-12.io.meshstor:tcp:node1:node2 for a partition on node1 exported to node2 over TCP. The format makes subsystems predictable from node identities and greppable in nvme list-subsys output or under /sys/kernel/config/nvmet/subsystems/ when debugging.

Reconciliation Loop

The node plugin runs a reconciliation loop triggered by the liveness probe every 10 seconds. Each cycle is idempotent — it can be interrupted and restarted safely.

Processing order:

  1. Stop foreign MD devices — stop MD arrays for volumes not assigned to this node
  2. Collect partitions for export — create requested partitions and prepare them for NVMe-oF export
  3. Export partitions — configure NVMe-oF target subsystems and namespaces
  4. Process deleting partitions — remove partitions marked for deletion
  5. Add imported partitions — reconnect remote partitions to their MD arrays
  6. Process member replacements — swap faulty or missing members with new partitions
  7. Process volume expansion — grow partitions and MD arrays for expanding volumes
  8. Update device stats — refresh MeshStorNodeDevice CRs with current free space

Custom Resources

MeshStor uses two cluster-scoped CRDs to track state. Both are managed entirely by the driver — operators do not create them manually.

MeshStorVolume (msvol)

Tracks the desired and actual state of each volume. Created by the controller on CreateVolume, updated by both controller and node plugins.

kubectl get msvol
NAME                  PHASE    MDSTATE   TOTAL   ACTIVE   FAILED   DOWN   SYNC   AGE
pvc-a1b2c3d4-...     Synced   active    2       2        0        0             5m

See CRD Specification for the complete field reference.

MeshStorNodeDevice (msnd)

Tracks per-node block device inventory. Created and updated by each node plugin during reconciliation.

kubectl get msnd
NAME                    NODE        DEVICE    MODEL              SIZE        FREE        VOLS   AGE
node1-nvme0n1           node1       nvme0n1   Samsung 990 PRO    1.0TB       800.0GB     2      1h
node2-nvme0n1           node2       nvme0n1   Samsung 990 PRO    1.0TB       950.0GB     1      1h

Privilege Requirements

Both controller and node pods run as privileged containers with SYS_ADMIN capability. The node plugin requires this for:

  • GPT partition creation and removal (/dev/ access)
  • NVMe-oF target configuration (/sys/kernel/config/nvmet)
  • MD RAID management (mdadm)
  • Device probing (udevadm, partprobe)
  • Mount operations (bind mounts for pod volumes)

The node DaemonSet uses hostNetwork: true to enable direct NVMe-oF TCP/RDMA communication between nodes.

What's Next