Architecture¶

This page describes how a write from a pod travels through MeshStor to a sector on disk, layer by layer. For the components that maintain that data path — the controller StatefulSet, the node DaemonSet, the reconciliation loop, and the custom resources — see Internals.

MeshStor delivers replicated block storage using only kernel subsystems: GPT partitions on local NVMe drives, NVMe-oF for network transport, and MD RAID for replication. Every byte travels from pod to disk through kernel code paths — no userspace proxies, no protocol translation, no custom replication engines.

The Data Path¶

flowchart TB
    POD["Pod"]
    XFS["XFS Filesystem"]
    MD["MD RAID1<br/>/dev/md0"]
    LP["Local Partition<br/>nvme0n1p3"]
    RP["Remote Partition<br/>nvme2n1"]
    LD[("Local NVMe Drive")]
    RD[("Remote NVMe Drive")]

    POD -->|"POSIX read/write"| XFS
    XFS -->|"block I/O"| MD
    MD -->|"mirror"| LP
    MD -->|"mirror"| RP
    LP --> LD
    RP -->|"NVMe-oF TCP/RDMA"| RD

Reads come from the local partition — sub-millisecond latency, identical to a non-replicated local volume. Writes go to both mirrors in parallel; the write completes when both acknowledge. The MD write-intent bitmap tracks which regions are dirty, so only changed blocks need resync after an interruption — not the entire volume.

Write amplification¶

A write to a numberOfCopies=N MeshStor volume becomes N NVMe-oF writes — one to each copy. The arithmetic for an evaluator's mental model:

Property	Formula	Notes
Write IOPS to backing devices	N × volume IOPS	Each replicated write hits N partitions
Write throughput required from each backing device	volume throughput	Per copy, not aggregated
Write latency observed by the pod	`max(local, remote_1, …, remote_{N-1})`	Not a sum — copies are written in parallel
Read latency observed by the pod	local latency	Reads are served from the local partition
Network throughput on each remote link	volume write throughput	One inbound stream per copy hosted on that node

For numberOfCopies=1, all of the multipliers collapse to 1 — the volume goes through the same data path as a replicated volume but only writes to one underlying partition. There is no remote NVMe-oF traffic in steady state.

For RAID10 (drivesPerCopy ≥ 2), each copy is itself striped across multiple drives, so the local IOPS multiplier becomes N × drivesPerCopy against N × drivesPerCopy underlying partitions. See Replication for the RAID-level detail.

Failure domains¶

Each layer of the data path protects against a specific failure class. The table below shows which layer catches what.

Failure	Protected by	Surviving copies
Bad disk sector	XFS metadata + MD scrub	All — MD rewrites the bad sector from a good copy
Whole drive failure	MD RAID1 / RAID10 (degraded operation)	N − drives_lost; the volume continues serving reads and writes
Node failure	NVMe-oF host disconnect → MD degraded mode	N − 1; the local copy on the failed node is gone, the remote copies remain
Network partition	NVMe-oF host disconnect → MD degraded mode	N − copies_on_partitioned_side
Simultaneous loss of all copies	Nothing	Data loss

The number of copies (numberOfCopies) is the dominant variable. Two copies survive any single failure; three copies survive any two simultaneous failures. One copy (the numberOfCopies=1 mode) survives no failure of the underlying storage but does survive a soft eviction of the pod because MeshStor relocates the partition to another node — see the local-storage section of Comparison.

For the timeline of how MD detects and responds to each failure class, see Self-Healing.

Storage Layer: GPT Partitions¶

Volumes are real GPT partitions on physical NVMe drives — not files, not loopback devices, not thin-provisioned images. Each volume gets a dedicated partition with a deterministic UUID derived from the volume name. Multiple volumes share a drive through the GPT partition table.

This means:

Direct block device I/O — no filesystem-on-filesystem overhead, no copy-on-write tax
Partition alignment to 1 MiB boundaries for optimal NVMe write performance
Clean tenancy — new partitions are zeroed and superblocks cleared to prevent stale metadata from previous users
XFS formatted with a per-volume UUID for consistent identification across reboots and node migrations

Transport Layer: NVMe-oF¶

Remote partitions appear as native NVMe block devices on the consuming node. The kernel's NVMe-oF initiator handles all data I/O — MeshStor only configures the connection at setup time.

Transport	When Used	Latency	CPU Overhead
RDMA	Both nodes share an RDMA subnet	Lowest (~1 us network)	Near zero (kernel bypass)
TCP	Any IP network	Low (~10 us network)	Minimal (kernel TCP stack)

Transport is selected automatically per node pair. When both nodes advertise RDMA addresses in the same subnet, RDMA is used. Otherwise, TCP provides a capable fallback that works over any IP network without special hardware.

Connection parameters are tuned for fast failure detection:

Parameter	Value	Purpose
Keep-alive	1s	Detect hung controllers quickly
Fast I/O fail	1s	Fail I/O to MD for fast failover
Controller loss	3s	Tear down connection if unrecoverable
Reconnect delay	1s	Retry connection immediately

Address configuration (TCP and optional RDMA annotations, required ports, and the subnet connectivity rule) lives in Prerequisites.

Replication Layer: MD RAID¶

Linux MD RAID mirrors writes across local and remote partitions. This is the same subsystem that has protected production Linux servers for decades — battle-tested, well-understood, and maintained by the kernel community.

Copies	Drives	RAID Level	Behavior
1	1	RAID1 (2 slots)	1 active + 1 placeholder for relocation
2	1	RAID1	Mirror across 2 nodes
3	1	RAID1	Mirror across 3 nodes
2	2	RAID10	Striped mirrors across 2 nodes, 2 drives each

Key design choices:

Write-intent bitmap — tracks dirty regions at block granularity. After a brief disconnection, only the changed blocks resync — not the entire volume. This turns a multi-hour rebuild into seconds or minutes.
Assume-clean creation — new arrays skip the initial full sync because all members start empty. The first real data write is the only write.
Local-first reads — MD preferentially reads from the local partition, keeping read latency identical to a non-replicated local NVMe volume.
Interleaved layout — for RAID10, partitions from different nodes alternate across mirror groups. No single node failure can take out an entire stripe.

See Replication for StorageClass configuration, degraded operation, and recovery behavior.

Self-Healing¶

MeshStor continuously monitors volume health through a reconciliation loop on every node. Recovery is automatic — no operator intervention required.

Automatic reconnection — if an NVMe-oF connection drops and recovers (transient network blip, node reboot), the kernel re-establishes the connection and MD resyncs only the dirty bitmap regions. For brief interruptions, resync completes in seconds.

Member replacement — if a partition stays unreachable beyond the configurable timeout (default 15 minutes, minimum 60 seconds), the reconciler selects the best available node and provisions a replacement partition. The new member syncs from the surviving copy, and the old member is cleaned up automatically.

Drain migration — when a node is drained with kubectl drain, the volume migrates transparently to the new node. The old partition is imported via NVMe-oF, a new local partition is created, MD syncs the data, and the old partition is removed. The pod sees no data loss and minimal interruption.

See Self-Healing for failure recovery and Volume Relocation for drain migration details.

Node Placement¶

When selecting which nodes host volume partitions, MeshStor scores every candidate node:

Factor	Priority	Rationale
RDMA connectivity	Highest	Lower latency, lower CPU overhead on the data path
Available free space	High	Distributes volumes evenly, avoids capacity hotspots
Network latency	Medium	Prefers topologically closer nodes
Fault isolation	Enforced	One partition per node per volume — no single-node SPOF

The scoring runs at volume creation and again when replacing a failed member, ensuring optimal placement adapts to the current cluster state.

This page covered the data path — how a write physically reaches a sector on disk. Related pages cover adjacent topics:

Internals — the components that operate the data path: controller StatefulSet, node DaemonSet, reconciliation loop, custom resources
Replication — MD RAID semantics, degraded operation, resync behavior
Performance — overhead analysis of the data path described above
Self-Healing — how the system responds to network partitions and node failures