Skip to content

Custom Kernel Patches

MeshStor maintains a small set of out-of-tree patches to the Linux MD driver. The driver runs on a stock kernel — these patches are not required for correctness — but each one closes a specific performance or operability gap that matters for the local-NVMe + NVMe-oF topology MeshStor uses. The patches are tracked separately on the upstream submission path; this page summarises what each one does and what changes for MeshStor when it is loaded.

For supported distributions and stock kernel versions, see Compatibility.

Latency-aware read balance (md-latency-ewma)

Linux branch: wip/md-raid1-raid10-latency Files: drivers/md/md.h, drivers/md/raid1.c, drivers/md/raid10.c, drivers/md/md.c

Adds a per-rdev exponentially weighted moving average of read completion latency (latency_ewma_ns) and uses it as the cost function for read selection in choose_best_rdev (RAID1) and the equivalent path in RAID10:

cost = latency_ewma_ns × (nr_pending + 1)

The disk with the lowest cost wins. A 12.5% tiebreak falls back to the existing closest-distance heuristic to keep symmetric mirrors stable. Sample latencies are clamped at 10 ms before blending. The EWMA uses α = 1/16 (shift 4), and the first completed read on each rdev seeds the EWMA directly so the cost function works from the very first IO — no convergence window. Sequential-read short-circuit, WriteMostly handling, and resync round-robin paths are unchanged. Per-rdev latency_ewma_ns is exposed via sysfs for debugging.

Effect on MeshStor. With local NVMe (~10 µs) and remote NVMe-oF (~30 µs RDMA, ~100–200 µs TCP), the cost ratio directs steady-state reads to the local replica and overflows to remote only when the local queue depth grows enough that pending × local_latency exceeds remote_latency. Without this patch, MD's stock heuristic for non-rotational arrays balances reads by nr_pending alone and splits read traffic roughly evenly across replicas regardless of fabric — making remote NVMe-oF latency directly visible to the pod.

In-place RAID1 → RAID10 takeover (md-raid1-to-raid10-takeover)

Linux branch: wip/md-raid1-to-raid10-takeover Files: drivers/md/raid10.c, drivers/md/md.c

Adds raid10_takeover_raid1() — a zero-copy personality swap from a healthy RAID1 to RAID10 with near_copies = N, N geometry. For near_copies == raid_disks the on-disk byte layout under RAID1 and RAID10-near is provably identical: every byte lives at the same physical offset on every disk. Under that invariant no data moves and no resync runs — the takeover is a pure personality swap committed via level_store (writing "raid10" to /sys/block/mdX/md/level).

Strict preconditions are enforced before any state mutation — v1.x metadata, no in-flight reshape, no external_size, at least two disks, not degraded, no WriteMostly member. Each failed precondition emits a distinct pr_warn so operators can identify which check fired without a debugger.

Effect on MeshStor. Enables in-place expansion from replicaCount=N, stripeWidth=1 (RAID1) to replicaCount=N, stripeWidth>1 (RAID10) without rebuilding the array. This is the kernel-side dependency for the RAID1→RAID10 reshape feature listed under Project Status → Available to paid customers.

Per-bucket resync barriers in RAID10 (per-bucket-arrays)

Linux branch: per-bucket-arrays Files: drivers/md/raid10.c, drivers/md/raid1-10.c

Replaces RAID10's single global barrier / nr_pending / nr_waiting / nr_queued scalars with arrays of BARRIER_BUCKETS_NR atomic_t elements (one bucket per 64 MiB region, sector-hashed via sector_to_idx()). RAID1 already uses per-bucket barriers; this change ports the same mechanism to RAID10. The seqlock-based fast path is preserved. freeze_array / unfreeze_array continue to drain the whole array via a separate array_freeze_pending flag; raid10_quiesce uses new raise_barrier_all / lower_barrier_all helpers.

Effect on MeshStor. Resync of one 64 MiB region only blocks application I/O hitting the same region, instead of stalling the entire array. On RAID10 volumes that are degraded and resyncing — common during member replacement and drain migration — foreground I/O continues unimpeded for sectors outside the active resync window. Without this patch, a single in-flight resync stalls every concurrent application read or write to the array.

What's Next

  • Compatibility — supported distributions and stock kernel versions
  • Architecture — where the EWMA-driven read selection shows up in the data path
  • Project Status — roadmap items that depend on these patches