Kernel Features¶
Linux kernel changes that materially affect MeshStor on the local-NVMe + NVMe-oF topology. The first group is upstream — present in mainline kernels at the listed version. Some are automatic on upgrade; others are opt-in per-array. The second group is out-of-tree work the MeshStor maintainers carry while upstream contribution is in flight; you only need it if you want the specific behavior it adds, and MeshStor still works correctly without it.
For supported distributions and the kernel floor, see Compatibility.
Mainline features¶
Lockless write-intent bitmap (llbitmap) — kernel 6.18+¶
Author: Yu Kuai (Huawei) — introducing patch series
MeshStor uses MD's internal write-intent bitmap so that a transient network failure or crash recovers via incremental resync instead of a full-disk rebuild. The classic bitmap takes a global spinlock on every dirty-bit update, which becomes the dominant write-path cost on fast NVMe — Yu Kuai's patch series cited "lock contention and huge IO performance degradation for all raid levels" as the motivation for the rewrite.
llbitmap drops the lock from the I/O fast path and is mainline since 6.18. It is selected at array-create time via the bitmap_type sysfs attribute and is opt-in — the classic bitmap (bitmap=internal in mdadm) remains the default. MeshStor will start opting in once the selector is exposed in stable mdadm releases on its supported distributions; until then, expect the classic bitmap.
Effect on MeshStor. On NVMe-oF arrays under sustained random write load, llbitmap is the single largest knob the write path has — the global spinlock no longer serializes writers, and bitmap I/O is integrated with the block layer instead of going through the page cache.
NVMe-oF + MD bitmap data-transfer fix — kernel 6.11+¶
Author: Ofir Gal (Volumez) — md/md-bitmap: fix writing non bitmap pages (commit ab99a87)
A pre-6.11 bug in __write_sb_page() rounded the bitmap I/O size up to the device's optimal I/O size without bounding it by the actual bitmap allocation. On nvme-tcp the resulting non-bitmap pages tripped the sendpage_ok() check and stopped the data transfer, hanging mdadm --create and writing garbage past the bitmap region — the exact topology MeshStor uses. Tracking issue: LP#2075110.
Effect on MeshStor. This is the reason the kernel floor is 6.11+. On a stock 6.11 or newer kernel the fix is already in — no action required. The only thing to watch for is distributions that backport selectively from older base kernels; keep those on the patched build.
Out-of-tree patches (not yet contributed)¶
Latency-aware read balance for RAID1 and RAID10¶
MeshStor reads from whichever replica is currently fastest. On a healthy 2-replica volume, the local NVMe drive (~10 µs) wins; reads only spill over to the remote replica (~30 µs over RDMA, ~100–200 µs over TCP) when the local queue is deep enough that going remote actually pays off.
Stock MD doesn't track per-disk latency on non-rotational arrays — it splits reads across replicas by queue depth alone, so a meaningful share of reads cross the wire even when the local replica is idle. With this patch, that traffic stays local and the pod sees local-NVMe latency on its read path. The patch keeps a per-replica moving average of read completion latency and uses it as a cost function alongside the existing queue-depth tiebreak; sequential-read short-circuits and WriteMostly handling are unchanged.
In-place RAID1 → RAID10 reshape¶
Lets a replicaCount=N, stripeWidth=1 volume (RAID1) become replicaCount=N, stripeWidth>1 (RAID10) without copying any data. Under the specific RAID10 geometry MeshStor uses, the on-disk byte layout is provably identical to RAID1 — every byte lives at the same physical offset on every disk — so the kernel just relabels the array and no resync runs. Stock MD requires a full reshape.
This patch is the kernel-side dependency for the RAID1→RAID10 reshape feature for paid customers.
Per-region resync barriers in RAID10¶
When an MD RAID10 array is resyncing — common during member replacement and drain migration — stock MD blocks every concurrent application read and write across the entire array until the resync passes. This patch ports RAID1's existing per-region (64 MiB) barrier mechanism to RAID10, so only the region currently being resynced blocks I/O; everywhere else, the application sees normal latency.
What's Next¶
- Compatibility — supported distributions and stock kernel versions
- Architecture — where the latency-aware read selection shows up in the data path
- Project Status — roadmap items that depend on these patches