Comparison¶

LLM-generated draft — not proofread

This page was drafted by an LLM and has not been reviewed by a human. Treat every claim as unverified until a maintainer signs off.

Each of the storage projects below is a good fit for a different problem. This page describes how MeshStor differs technically — pick the option that matches your constraints, not the one with the longest table.

The page is in two sections:

Replicated block storage — comparison with Longhorn, OpenEBS Mayastor, Rook-Ceph, and LINSTOR/Piraeus.
Local storage CSIs — comparison with TopoLVM, OpenEBS LocalPV-LVM, and local-path-provisioner.

Replicated block storage¶

Compared: Longhorn, OpenEBS Mayastor, Rook-Ceph, LINSTOR/Piraeus.

Architectural axis¶

Axis	Longhorn	OpenEBS Mayastor	Rook-Ceph	LINSTOR / Piraeus	MeshStor
Data path location	Userspace (longhorn engine)	Userspace (SPDK)	Kernel (RBD client)	Kernel (DRBD module)	Kernel (NVMe-oF + MD)
Replication mechanism	Custom longhorn engine	Custom SPDK NVMe-oF	CRUSH + RBD	DRBD	Linux MD RAID
Hardware constraints	None unusual	HugePages, kernel ≥ 5.15, dedicated raw devices, 2 dedicated cores per IO engine	Dedicated 10GbE+ strongly recommended	DRBD kernel module	NVMe drive
License	Apache 2.0	Apache 2.0	Apache 2.0	GPL (DRBD/LINSTOR), Apache 2.0 (Piraeus)	Apache 2.0
File / object support	Block + RWX via NFS shim	Block only	Block + File (CephFS) + Object (RGW)	Block only	Block only
Snapshots / clones today	Yes (with S3 backup)	Yes	Yes	Yes	Planned (with S3 backup)

Longhorn¶

Longhorn is a turnkey replicated block storage solution with a built-in web UI and automated backup to S3. It uses a custom userspace longhorn engine to handle replication.

Use Longhorn when you need a turnkey storage UI and S3 backup baked into the storage layer rather than added as a separate tool.

MeshStor differs by keeping the entire data path in the kernel — no userspace longhorn engine. Documented Longhorn failure modes related to its userspace engine (such as read-only filesystem bugs reported under sustained load) do not apply to MeshStor's MD RAID + NVMe-oF data path. MeshStor does not currently provide a built-in UI or built-in S3 backup yet.

OpenEBS Mayastor¶

Mayastor is the closest architectural sibling to MeshStor — both are NVMe-oF based replicated block storage. The mechanism is different: Mayastor uses SPDK in userspace with two dedicated cores per IO engine running pollers regardless of load, requires HugePages, requires Linux kernel 5.15 or newer, and requires dedicated raw block devices per DiskPool (a DiskPool cannot span multiple devices).

Use Mayastor when you can dedicate cores to storage IO and you want the most mature NVMe-oF userspace stack available today, including snapshots and clones.

MeshStor differs by using the in-kernel NVMe-oF target and Linux MD RAID. There is no core pinning, no HugePages requirement, no fixed CPU consumption from poller threads, and GPT partitions take the place of dedicated raw devices — MeshStor can even carve a partition out of unallocated space on the OS drive. MeshStor does not currently provide snapshots or clones — those are planned for the roadmap, see Project Status.

Rook-Ceph¶

Rook-Ceph is the kitchen-sink storage option. A single Rook-Ceph cluster provides block (RBD), file (CephFS), and S3-compatible object (RGW) storage. It supports erasure coding for capacity efficiency, asynchronous RBD mirroring for disaster recovery, encryption at rest, metro stretch clusters, and a long list of other capabilities.

The cost is operational complexity. Rook-Ceph requires understanding Ceph fundamentals — CRUSH maps, placement groups, OSD lifecycle, monitor quorum. It runs multiple daemon types (MON, MGR, OSD, and optionally MDS for CephFS and RGW for object). Dedicated 10GbE or faster networking is strongly recommended. Running databases on Ceph has a reputation among DBAs, and not a happy one.

Use Rook-Ceph when you need multi-protocol storage (block + file + object), when you need erasure coding for cost-efficient capacity, when you need cross-site disaster recovery, or when you already standardize on Ceph elsewhere in your infrastructure.

MeshStor differs by being purpose-built for replicated block only. It is designed for database workloads. There is a single binary, no separate storage cluster, no CRUSH maps, no PG tuning. MeshStor is the right answer when you want the operational simplicity of "another Kubernetes Deployment" rather than "a distributed storage system that happens to run on Kubernetes". It is the wrong answer when you need the multi-protocol surface or the depth of features Ceph provides.

LINSTOR / Piraeus¶

LINSTOR is the closest functional sibling to MeshStor — both perform replication in the kernel. LINSTOR is built on DRBD, which has 20+ years of in-tree provenance in Linux HA clusters. Piraeus is the Apache 2.0 Kubernetes operator that wraps LINSTOR. The combination supports multiple storage backends (LVM, LVM Thin, ZFS, ZFS Thin), three replication modes (synchronous, asynchronous, semi-synchronous for WAN DR), snapshots and clones available today, LUKS encryption, and TLS for all replication traffic. GigaOm rated LINBIT a Leader in the 2024 Kubernetes Data Storage Radar.

Use LINSTOR when you need WAN-distance asynchronous replication for cross-site disaster recovery, when you need ZFS-backed storage pools, or when you need snapshots, clones, or encryption that work today.

MeshStor differs by using the in-kernel MD RAID subsystem rather than DRBD. MD RAID has no out-of-tree kernel module dependency — it is part of the mainline Linux kernel everywhere. MeshStor uses GPT partitions on raw NVMe drives directly, which is a simpler hardware setup than configuring an LVM volume group or ZFS pool. The entire MeshStor stack is Apache 2.0; LINSTOR's underlying DRBD and LINSTOR engine are GPL while the Piraeus operator wrapper is Apache 2.0.

The control-plane shape is different too. LINSTOR stacks a LINSTOR controller, per-node satellites, a CSI plugin, and the Piraeus operator on top of each other, with the satellites talking back to the controller. MeshStor is a single binary with no inter-node control-plane traffic — desired state lives in Kubernetes CRDs and every node reconciles against them directly. Even though both DRBD and MD RAID live in the kernel, MD RAID has a smaller per-IO overhead and MeshStor avoids the expensive in-kernel data processing on the replica side.

MeshStor is the right answer when you specifically want kernel-grade replication without satellite ↔ controller state drift, without out-of-tree kernel modules and their version-matrix hell, and without having to pin pods to the nodes where their volume replicas live.

Local storage CSIs¶

Compared: TopoLVM, OpenEBS LocalPV-LVM, local-path-provisioner.

This section makes the case that MeshStor with replicaCount=1 is a strict superset of pure local storage CSIs, and reserves pure local storage for narrow exception cases.

Capability table¶

Capability	TopoLVM	OpenEBS LocalPV-LVM	local-path-provisioner	MeshStor `replicaCount=1`
Pod can reschedule to another node (drain, eviction, OOM, taints)	No	No	No	Yes — partition relocates without data loss
Survives full node loss (disk failure, hardware death)	No	No	No	No (needs `replicaCount ≥ 2`)
Snapshots / clones	LVM-local	LVM-local	No	Cross-node (planned)
Online expansion	Yes	Yes	No	Yes
Hardware requirements	LVM volume group	LVM volume group	Any directory	NVMe drive

The honest answer on the second row matters: replicaCount=1 is not a substitute for replication. The advantage is that the pod is no longer pinned to one node — every other dimension where local storage seemed simpler is actually a tie or a MeshStor win.

If you don't need replication, use MeshStor anyway¶

A replicaCount=1 MeshStor volume goes through the same data path as a replicated volume but only writes to one underlying partition. You get all of the operational features of the replicated mode — pod rescheduling across nodes, partition relocation on drain, future cross-node snapshots — without paying the replicated-write multiplier.

MeshStor's data path is categorically faster than LVM thin¶

Most local storage CSIs in production use LVM as the volume manager. TopoLVM uses LVM logical volumes, often thin-provisioned to support snapshots. OpenEBS LocalPV-LVM does the same. The choice between LVM-linear and LVM-thin shapes a large part of local-CSI performance, and LVM-thin's overhead is substantial even before snapshots enter the picture.

Published evidence of dm-thin overhead:

4K random reads on an 8-SSD array — Dale Stephenson posted fio results on the linux-lvm list: thick LVM 251,303 IOPS vs thin LVM 146,792 IOPS — a 42% reduction on single-threaded 4K random reads with the same underlying hardware.¹
Synchronous writes on NVMe — Yarden Maymon's dm-thin fast-path patch cover letter reports 204K IOPS for O_SYNC writes on a fully-allocated dm-thin volume versus 438K IOPS without O_SYNC and 451K IOPS directly on the underlying disk — roughly a 50% degradation for synchronous writes that get deferred through the pool worker instead of taking the fast path.²
Direct evidence of the concurrency bottleneck — the Thornber/Snitzer/Patocka patch series "dm bufio, thin: improve concurrent IO performance"³ exists specifically to address the single pool worker and bio-prison spinlock in dm-thin. The patches themselves are an acknowledgement from the dm-thin maintainers that the current implementation does not scale concurrently on fast storage.

The root causes are architectural: dm-thin defers every write to a single pool worker thread, double-locks each bio in the bio-prison under a global spinlock for virtual and physical key tracking, and performs btree metadata lookups on the IO path. On NVMe (20–70 µs device latency), that software overhead is a visible fraction of total IO time.

MeshStor's replicaCount=1 data path is XFS → MD (single member) → GPT partition → NVMe drive. With bitmap=internal enabled — the MeshStor default — the MD layer adds a few function-pointer indirections plus write-intent bitmap bookkeeping. On SSD/NVMe the remaining overhead is primarily bitmap lock contention rather than disk seek penalty, and recent kernel work by Yu Kuai to refactor the MD bitmap⁴ is aimed at exactly this — the motivation cited in the patch series is "lock contention and huge IO performance degradation for all raid levels." The bottleneck is under active reduction with each kernel release.

The net picture. LVM-thin-based local CSIs carry a measurable IOPS tax on small random operations and an additional penalty for every synchronous write, driven by architectural choices the dm-thin maintainers themselves are working to fix. MeshStor's replicaCount=1 data path avoids dm-thin entirely. On OLTP or fsync-heavy workloads, MeshStor's data path is categorically faster.

Even for app-replicated databases, `replicaCount=1` beats pure local storage¶

Picture a 3-node PostgreSQL Patroni cluster. One of the nodes hits memory pressure and Kubernetes evicts the database pod. The node itself is still alive — its disks are fine, kubelet is healthy — the only problem is that the database pod can't run there anymore.

With pure local storage CSIs (TopoLVM, OpenEBS LocalPV-LVM, local-path-provisioner): the PVC is topology-pinned to the original node. The pod cannot reschedule anywhere else. Either it stays Pending until the original node has free memory again, or the operator deletes the PVC and forces Patroni into a full pg_basebackup onto a fresh local volume on another node. During that basebackup, the rebuilding member serves no queries, and one of the two surviving members is tied up serving the backup — leaving the cluster running on a single full-speed replica until the rebuild finishes. Soft eviction has triggered the same recovery cost as a disk failure would have.

With MeshStor replicaCount=1: the pod reschedules onto another node. MeshStor mounts the original partition remotely over NVMe-oF on the new node, and the pod resumes immediately reading the original data at slightly degraded latency. The relocation moves the partition to the new node in the background. The two surviving members stay at full speed throughout; the relocated member is mildly degraded only during the relocation window. No DB-level rebuild is needed.

Scoping note. This scenario assumes the original node is alive but unable to host the pod (drain, memory pressure, OOM, taint). If the node is fully lost (disk failure, hardware death), replicaCount=1 loses data exactly like pure local storage. To survive full node loss, use replicaCount ≥ 2.

Snapshot cost model¶

Beyond steady-state throughput, LVM thin pays the snapshot tax continuously. dm-thin carries metadata-lookup overhead even when no snapshot has been taken yet, because the volume still uses the dm-thin metadata layer. Once a snapshot exists, every first-touch write to a shared chunk triggers a full copy-on-write cycle: read the entire shared chunk, write it to a new location, update the btree mapping, then submit the original user write.

The write amplification is brutal. For a 4 KiB user write against a 64 KiB shared chunk (LVM thin's default), the amplification is ~33×. At 512 KiB chunks it reaches ~256×. On SSD each first-touch COW adds 100–500 µs per chunk; on HDD, seeks compound and each COW costs 15–25 ms.

MeshStor's planned snapshot mechanism pays performance penalty cost at snapshot creation, not at write time. The implementation detaches one RAID member as the consistent point and resyncs a replacement member — see Project Status for the full procedure in both the replicaCount=1 and replicaCount≥2 cases. During the resync window, foreground writes pay a straightforward 2× amplification and contend moderately with the resync traffic. Once the resync completes, normal writes return to zero per-write overhead.

For evaluators with write-heavy workloads who plan to use snapshots, the difference is paying an amortized 33–256× tax on every first-touch write forever versus paying a lump-sum resync cost once. For workloads that don't touch every chunk after a snapshot, MeshStor also wins on total bytes written.

Exception: local-path-provisioner uses raw hostPath directories with no volume manager at all. That is the fastest possible local storage path — but it has none of the operational features (no expansion, no snapshots, no quota enforcement, no relocation) and no data protection of any kind.

So when is pure local storage still the right choice?¶

A short, narrow list:

You cannot run the MeshStor controller and DaemonSet at all. Extremely resource-constrained edge nodes where the single-binary footprint is unacceptable.
You are already running TopoLVM or OpenEBS LocalPV-LVM in production and the migration cost outweighs the operational benefit of switching.
You need local-path-provisioner for ephemeral CI runners and dev clusters where you genuinely don't need any data services. That is a fine use of local-path-provisioner and MeshStor would be overkill.

Outside of these cases, replicaCount=1 MeshStor is the better default for unreplicated workloads.

What's Next¶

Use Cases — gut-check on whether MeshStor fits your situation
Project Status — Technical Preview maturity statement
Performance — architectural overhead breakdown
Installation — deploy MeshStor and create your first volume

Dale Stephenson, [linux-lvm] Performance penalty for 4k requests on thin provisioned volume, linux-lvm mailing list, 2017-09-14. ↩
Yarden Maymon, [PATCH] dm-thin: handle fast-path O_SYNC IO, dm-devel mailing list, 2023-10-30. ↩
Joe Thornber, Mike Snitzer, Mikulas Patocka, dm bufio, thin: improve concurrent IO performance, LWN.net, 2023. ↩
Yu Kuai, md/md-bitmap: introduce bitmap_operations and make structure internal, LWN.net, 2024-08-26. ↩