Architecture¶
This page describes how a write from a pod travels through MeshStor to a sector on disk, layer by layer. For the components that maintain that data path — the controller StatefulSet, the node DaemonSet, the reconciliation loop, and the custom resources — see Internals.
MeshStor delivers replicated block storage using only kernel subsystems: GPT partitions on local NVMe drives, NVMe-oF for network transport, and MD RAID for replication. Every byte travels from pod to disk through kernel code paths — no userspace proxies, no protocol translation, no custom replication engines.
The Data Path¶
flowchart TB
POD["Pod"]
XFS["XFS Filesystem"]
MD["MD RAID1<br/>/dev/md0"]
LP["Local Partition<br/>nvme0n1p3"]
RP["Remote Partition<br/>nvme2n1"]
LD[("Local NVMe Drive")]
RD[("Remote NVMe Drive")]
POD -->|"POSIX read/write"| XFS
XFS -->|"block I/O"| MD
MD -->|"mirror"| LP
MD -->|"mirror"| RP
LP --> LD
RP -->|"NVMe-oF TCP/RDMA"| RD
Reads come from the local partition — sub-millisecond latency, identical to a non-replicated local volume. Writes go to both mirrors in parallel; the write completes when both acknowledge. The MD write-intent bitmap tracks which regions are dirty, so only changed blocks need resync after an interruption — not the entire volume.
Write amplification¶
A write to a numberOfCopies=N MeshStor volume becomes N NVMe-oF writes — one to each copy. The arithmetic for an evaluator's mental model:
| Property | Formula | Notes |
|---|---|---|
| Write IOPS to backing devices | N × volume IOPS | Each replicated write hits N partitions |
| Write throughput required from each backing device | volume throughput | Per copy, not aggregated |
| Write latency observed by the pod | max(local, remote_1, …, remote_{N-1}) |
Not a sum — copies are written in parallel |
| Read latency observed by the pod | local latency | Reads are served from the local partition |
| Network throughput on each remote link | volume write throughput | One inbound stream per copy hosted on that node |
For numberOfCopies=1, all of the multipliers collapse to 1 — the volume goes through the same data path as a replicated volume but only writes to one underlying partition. There is no remote NVMe-oF traffic in steady state.
For RAID10 (drivesPerCopy ≥ 2), each copy is itself striped across multiple drives, so the local IOPS multiplier becomes N × drivesPerCopy against N × drivesPerCopy underlying partitions. See Replication for the RAID-level detail.
Failure domains¶
Each layer of the data path protects against a specific failure class. The table below shows which layer catches what.
| Failure | Protected by | Surviving copies |
|---|---|---|
| Bad disk sector | XFS metadata + MD scrub | All — MD rewrites the bad sector from a good copy |
| Whole drive failure | MD RAID1 / RAID10 (degraded operation) | N − drives_lost; the volume continues serving reads and writes |
| Node failure | NVMe-oF host disconnect → MD degraded mode | N − 1; the local copy on the failed node is gone, the remote copies remain |
| Network partition | NVMe-oF host disconnect → MD degraded mode | N − copies_on_partitioned_side |
| Simultaneous loss of all copies | Nothing | Data loss |
The number of copies (numberOfCopies) is the dominant variable. Two copies survive any single failure; three copies survive any two simultaneous failures. One copy (the numberOfCopies=1 mode) survives no failure of the underlying storage but does survive a soft eviction of the pod because MeshStor relocates the partition to another node — see the local-storage section of Comparison.
For the timeline of how MD detects and responds to each failure class, see Self-Healing.
Storage Layer: GPT Partitions¶
Volumes are real GPT partitions on physical NVMe drives — not files, not loopback devices, not thin-provisioned images. Each volume gets a dedicated partition with a deterministic UUID derived from the volume name. Multiple volumes share a drive through the GPT partition table.
This means:
- Direct block device I/O — no filesystem-on-filesystem overhead, no copy-on-write tax
- Partition alignment to 1 MiB boundaries for optimal NVMe write performance
- Clean tenancy — new partitions are zeroed and superblocks cleared to prevent stale metadata from previous users
- XFS formatted with a per-volume UUID for consistent identification across reboots and node migrations
Transport Layer: NVMe-oF¶
Remote partitions appear as native NVMe block devices on the consuming node. The kernel's NVMe-oF initiator handles all data I/O — MeshStor only configures the connection at setup time.
| Transport | When Used | Latency | CPU Overhead |
|---|---|---|---|
| RDMA | Both nodes share an RDMA subnet | Lowest (~1 us network) | Near zero (kernel bypass) |
| TCP | Any IP network | Low (~10 us network) | Minimal (kernel TCP stack) |
Transport is selected automatically per node pair. When both nodes advertise RDMA addresses in the same subnet, RDMA is used. Otherwise, TCP provides a capable fallback that works over any IP network without special hardware.
Connection parameters are tuned for fast failure detection:
| Parameter | Value | Purpose |
|---|---|---|
| Keep-alive | 1s | Detect hung controllers quickly |
| Fast I/O fail | 1s | Fail I/O to MD for fast failover |
| Controller loss | 3s | Tear down connection if unrecoverable |
| Reconnect delay | 1s | Retry connection immediately |
Address configuration (TCP and optional RDMA annotations, required ports, and the subnet connectivity rule) lives in Prerequisites.
Replication Layer: MD RAID¶
Linux MD RAID mirrors writes across local and remote partitions. This is the same subsystem that has protected production Linux servers for decades — battle-tested, well-understood, and maintained by the kernel community.
| Copies | Drives | RAID Level | Behavior |
|---|---|---|---|
| 1 | 1 | RAID1 (2 slots) | 1 active + 1 placeholder for relocation |
| 2 | 1 | RAID1 | Mirror across 2 nodes |
| 3 | 1 | RAID1 | Mirror across 3 nodes |
| 2 | 2 | RAID10 | Striped mirrors across 2 nodes, 2 drives each |
Key design choices:
- Write-intent bitmap — tracks dirty regions at block granularity. After a brief disconnection, only the changed blocks resync — not the entire volume. This turns a multi-hour rebuild into seconds or minutes.
- Assume-clean creation — new arrays skip the initial full sync because all members start empty. The first real data write is the only write.
- Local-first reads — MD preferentially reads from the local partition, keeping read latency identical to a non-replicated local NVMe volume.
- Interleaved layout — for RAID10, partitions from different nodes alternate across mirror groups. No single node failure can take out an entire stripe.
See Replication for StorageClass configuration, degraded operation, and recovery behavior.
Self-Healing¶
MeshStor continuously monitors volume health through a reconciliation loop on every node. Recovery is automatic — no operator intervention required.
Automatic reconnection — if an NVMe-oF connection drops and recovers (transient network blip, node reboot), the kernel re-establishes the connection and MD resyncs only the dirty bitmap regions. For brief interruptions, resync completes in seconds.
Member replacement — if a partition stays unreachable beyond the configurable timeout (default 15 minutes, minimum 60 seconds), the reconciler selects the best available node and provisions a replacement partition. The new member syncs from the surviving copy, and the old member is cleaned up automatically.
Drain migration — when a node is drained with kubectl drain, the volume migrates transparently to the new node. The old partition is imported via NVMe-oF, a new local partition is created, MD syncs the data, and the old partition is removed. The pod sees no data loss and minimal interruption.
See Self-Healing for failure recovery and Volume Relocation for drain migration details.
Node Placement¶
When selecting which nodes host volume partitions, MeshStor scores every candidate node:
| Factor | Priority | Rationale |
|---|---|---|
| RDMA connectivity | Highest | Lower latency, lower CPU overhead on the data path |
| Available free space | High | Distributes volumes evenly, avoids capacity hotspots |
| Network latency | Medium | Prefers topologically closer nodes |
| Fault isolation | Enforced | One partition per node per volume — no single-node SPOF |
The scoring runs at volume creation and again when replacing a failed member, ensuring optimal placement adapts to the current cluster state.
What you'll find on related pages¶
This page covered the data path — how a write physically reaches a sector on disk. Related pages cover adjacent topics:
- Internals — the components that operate the data path: controller StatefulSet, node DaemonSet, reconciliation loop, custom resources
- Replication — MD RAID semantics, degraded operation, resync behavior
- Performance — overhead analysis of the data path described above
- Self-Healing — how the system responds to network partitions and node failures