Skip to content

Architecture

This page describes how a write from a pod travels through MeshStor to a sector on disk, layer by layer. For the components that maintain that data path — the controller StatefulSet, the node DaemonSet, the reconciliation loop, and the custom resources — see Internals.

MeshStor delivers replicated block storage using only kernel subsystems: GPT partitions on local NVMe drives, NVMe-oF for network transport, and MD RAID for replication. Every byte travels from pod to disk through kernel code paths — no userspace proxies, no protocol translation, no custom replication engines.

The Data Path

flowchart TB
    POD["Pod"]
    XFS["XFS Filesystem"]
    MD["MD RAID1<br/>/dev/md0"]
    LP["Local Partition<br/>nvme0n1p3"]
    RP["Remote Partition<br/>nvme2n1"]
    LD[("Local NVMe Drive")]
    RD[("Remote NVMe Drive")]

    POD -->|"POSIX read/write"| XFS
    XFS -->|"block I/O"| MD
    MD -->|"mirror"| LP
    MD -->|"mirror"| RP
    LP --> LD
    RP -->|"NVMe-oF TCP/RDMA"| RD

Reads come from the local partition — sub-millisecond latency, identical to a non-replicated local volume. Writes go to both mirrors in parallel; the write completes when both acknowledge. The MD write-intent bitmap tracks which regions are dirty, so only changed blocks need resync after an interruption — not the entire volume.

Write amplification

A write to a numberOfCopies=N MeshStor volume becomes N NVMe-oF writes — one to each copy. The arithmetic for an evaluator's mental model:

Property Formula Notes
Write IOPS to backing devices N × volume IOPS Each replicated write hits N partitions
Write throughput required from each backing device volume throughput Per copy, not aggregated
Write latency observed by the pod max(local, remote_1, …, remote_{N-1}) Not a sum — copies are written in parallel
Read latency observed by the pod local latency Reads are served from the local partition
Network throughput on each remote link volume write throughput One inbound stream per copy hosted on that node

For numberOfCopies=1, all of the multipliers collapse to 1 — the volume goes through the same data path as a replicated volume but only writes to one underlying partition. There is no remote NVMe-oF traffic in steady state.

For RAID10 (drivesPerCopy ≥ 2), each copy is itself striped across multiple drives, so the local IOPS multiplier becomes N × drivesPerCopy against N × drivesPerCopy underlying partitions. See Replication for the RAID-level detail.

Failure domains

Each layer of the data path protects against a specific failure class. The table below shows which layer catches what.

Failure Protected by Surviving copies
Bad disk sector XFS metadata + MD scrub All — MD rewrites the bad sector from a good copy
Whole drive failure MD RAID1 / RAID10 (degraded operation) N − drives_lost; the volume continues serving reads and writes
Node failure NVMe-oF host disconnect → MD degraded mode N − 1; the local copy on the failed node is gone, the remote copies remain
Network partition NVMe-oF host disconnect → MD degraded mode N − copies_on_partitioned_side
Simultaneous loss of all copies Nothing Data loss

The number of copies (numberOfCopies) is the dominant variable. Two copies survive any single failure; three copies survive any two simultaneous failures. One copy (the numberOfCopies=1 mode) survives no failure of the underlying storage but does survive a soft eviction of the pod because MeshStor relocates the partition to another node — see the local-storage section of Comparison.

For the timeline of how MD detects and responds to each failure class, see Self-Healing.

Storage Layer: GPT Partitions

Volumes are real GPT partitions on physical NVMe drives — not files, not loopback devices, not thin-provisioned images. Each volume gets a dedicated partition with a deterministic UUID derived from the volume name. Multiple volumes share a drive through the GPT partition table.

This means:

  • Direct block device I/O — no filesystem-on-filesystem overhead, no copy-on-write tax
  • Partition alignment to 1 MiB boundaries for optimal NVMe write performance
  • Clean tenancy — new partitions are zeroed and superblocks cleared to prevent stale metadata from previous users
  • XFS formatted with a per-volume UUID for consistent identification across reboots and node migrations

Transport Layer: NVMe-oF

Remote partitions appear as native NVMe block devices on the consuming node. The kernel's NVMe-oF initiator handles all data I/O — MeshStor only configures the connection at setup time.

Transport When Used Latency CPU Overhead
RDMA Both nodes share an RDMA subnet Lowest (~1 us network) Near zero (kernel bypass)
TCP Any IP network Low (~10 us network) Minimal (kernel TCP stack)

Transport is selected automatically per node pair. When both nodes advertise RDMA addresses in the same subnet, RDMA is used. Otherwise, TCP provides a capable fallback that works over any IP network without special hardware.

Connection parameters are tuned for fast failure detection:

Parameter Value Purpose
Keep-alive 1s Detect hung controllers quickly
Fast I/O fail 1s Fail I/O to MD for fast failover
Controller loss 3s Tear down connection if unrecoverable
Reconnect delay 1s Retry connection immediately

Address configuration (TCP and optional RDMA annotations, required ports, and the subnet connectivity rule) lives in Prerequisites.

Replication Layer: MD RAID

Linux MD RAID mirrors writes across local and remote partitions. This is the same subsystem that has protected production Linux servers for decades — battle-tested, well-understood, and maintained by the kernel community.

Copies Drives RAID Level Behavior
1 1 RAID1 (2 slots) 1 active + 1 placeholder for relocation
2 1 RAID1 Mirror across 2 nodes
3 1 RAID1 Mirror across 3 nodes
2 2 RAID10 Striped mirrors across 2 nodes, 2 drives each

Key design choices:

  • Write-intent bitmap — tracks dirty regions at block granularity. After a brief disconnection, only the changed blocks resync — not the entire volume. This turns a multi-hour rebuild into seconds or minutes.
  • Assume-clean creation — new arrays skip the initial full sync because all members start empty. The first real data write is the only write.
  • Local-first reads — MD preferentially reads from the local partition, keeping read latency identical to a non-replicated local NVMe volume.
  • Interleaved layout — for RAID10, partitions from different nodes alternate across mirror groups. No single node failure can take out an entire stripe.

See Replication for StorageClass configuration, degraded operation, and recovery behavior.

Self-Healing

MeshStor continuously monitors volume health through a reconciliation loop on every node. Recovery is automatic — no operator intervention required.

Automatic reconnection — if an NVMe-oF connection drops and recovers (transient network blip, node reboot), the kernel re-establishes the connection and MD resyncs only the dirty bitmap regions. For brief interruptions, resync completes in seconds.

Member replacement — if a partition stays unreachable beyond the configurable timeout (default 15 minutes, minimum 60 seconds), the reconciler selects the best available node and provisions a replacement partition. The new member syncs from the surviving copy, and the old member is cleaned up automatically.

Drain migration — when a node is drained with kubectl drain, the volume migrates transparently to the new node. The old partition is imported via NVMe-oF, a new local partition is created, MD syncs the data, and the old partition is removed. The pod sees no data loss and minimal interruption.

See Self-Healing for failure recovery and Volume Relocation for drain migration details.

Node Placement

When selecting which nodes host volume partitions, MeshStor scores every candidate node:

Factor Priority Rationale
RDMA connectivity Highest Lower latency, lower CPU overhead on the data path
Available free space High Distributes volumes evenly, avoids capacity hotspots
Network latency Medium Prefers topologically closer nodes
Fault isolation Enforced One partition per node per volume — no single-node SPOF

The scoring runs at volume creation and again when replacing a failed member, ensuring optimal placement adapts to the current cluster state.

This page covered the data path — how a write physically reaches a sector on disk. Related pages cover adjacent topics:

  • Internals — the components that operate the data path: controller StatefulSet, node DaemonSet, reconciliation loop, custom resources
  • Replication — MD RAID semantics, degraded operation, resync behavior
  • Performance — overhead analysis of the data path described above
  • Self-Healing — how the system responds to network partitions and node failures