Skip to content

Kernel Features

Linux kernel changes that materially affect MeshStor on the local-NVMe + NVMe-oF topology. The first group is upstream — present in mainline kernels at the listed version. All automatic on node reboot. The second group is out-of-tree work the MeshStor maintainers carry while upstream contribution is in flight; you only need it if you want the specific behavior it adds, and MeshStor still works correctly without it.

For supported distributions and the kernel floor, see Compatibility.

Mainline features

Lockless write-intent bitmap (llbitmap) — kernel 6.18+

Author: Yu Kuai (Huawei) — introducing patch series

MeshStor uses MD's internal write-intent bitmap so that a transient network failure or crash recovers via incremental resync instead of a full-disk rebuild. The classic bitmap takes a global spinlock on every dirty-bit update, which becomes the dominant write-path cost on fast NVMe — Yu Kuai's patch series cited "lock contention and huge IO performance degradation for all raid levels" as the motivation for the rewrite.

llbitmap drops the lock from the I/O fast path and is mainline since 6.18. It is selected at array-create time via the bitmap_type sysfs attribute and is opt-in — the classic bitmap (bitmap=internal in mdadm) remains the default. MeshStor forces llbitmap if it is available (supported by kernel or custom MeshStor kernel module loaded), otherwise the classic bitmap used.

Effect on MeshStor. On NVMe-oF arrays under sustained random write load, llbitmap is the single largest knob the write path has — the global spinlock no longer serializes writers, and bitmap I/O is integrated with the block layer instead of going through the page cache. On random-write workloads we measure up to 25% higher IOPS and up to 30% lower latency versus the classic bitmap.

Out-of-tree patches (not yet contributed)

Latency-aware read balance for RAID1 and RAID10

MeshStor reads from whichever replica is currently fastest. On a healthy 2-replica volume, the local NVMe drive (~10 µs) wins; reads only spill over to the remote replica (~30 µs over RDMA, ~100–200 µs over TCP) when the local queue is deep enough that going remote actually pays off.

Stock MD doesn't track per-disk latency on non-rotational arrays — it splits reads across replicas by queue depth alone, so a meaningful share of reads cross the wire even when the local replica is idle. With this patch, that traffic stays local and the pod sees local-NVMe latency on its read path. The patch keeps a per-replica moving average of read completion latency and uses it as a cost function alongside the existing queue-depth tiebreak; sequential-read short-circuits and WriteMostly handling are unchanged.

Effect on MeshStor. On random-read workloads we measure 17–100% higher IOPS and 15–50% lower latency versus stock MD; the upper end of the range corresponds to topologies where the remote replica is much slower than local (e.g. nvme-tcp over a congested link), where avoiding any remote hop matters most.

In-place RAID1 → RAID10 reshape

Lets a replicaCount=N, stripeWidth=1 volume (RAID1) become replicaCount=N, stripeWidth>1 (RAID10) without copying any data. Under the specific RAID10 geometry MeshStor uses, the on-disk byte layout is provably identical to RAID1 — every byte lives at the same physical offset on every disk — so the kernel just relabels the array and no resync runs. Stock MD requires a full reshape.

This patch is the kernel-side dependency for the RAID1→RAID10 reshape feature for paid customers.

Per-region resync barriers in RAID10

When an MD RAID10 array is resyncing — common during member replacement and drain migration — stock MD blocks every concurrent application read and write across the entire array until the resync passes. This patch ports RAID1's existing per-region (64 MiB) barrier mechanism to RAID10, so only the region currently being resynced blocks I/O; everywhere else, the application sees normal latency.

What's Next

  • Compatibility — supported distributions and stock kernel versions
  • Architecture — where the latency-aware read selection shows up in the data path
  • Project Status — roadmap items that depend on these patches