Comparison¶
Each of the storage projects below is a good fit for a different problem. This page describes how MeshStor differs technically — pick the option that matches your constraints, not the one with the longest table.
Third-party benchmark numbers on this page are cited inline. For the maturity context, see Project Status.
The page is in two sections:
- Replicated block storage — comparison with Longhorn, OpenEBS Mayastor, Rook-Ceph, and LINSTOR/Piraeus.
- Local storage CSIs — comparison with TopoLVM, OpenEBS LocalPV-LVM, and
local-path-provisioner. Includes the case for using MeshStor withnumberOfCopies=1instead of pure local storage.
Replicated block storage¶
Compared: Longhorn, OpenEBS Mayastor, Rook-Ceph, LINSTOR/Piraeus.
Architectural axis¶
| Axis | Longhorn | OpenEBS Mayastor | Rook-Ceph | LINSTOR / Piraeus | MeshStor |
|---|---|---|---|---|---|
| Data path location | Userspace (longhorn engine) | Userspace (SPDK) | Kernel (RBD client) | Kernel (DRBD module) | Kernel (NVMe-oF + MD) |
| Replication mechanism | Custom longhorn engine | Custom SPDK NVMe-oF | CRUSH + RBD | DRBD | Linux MD RAID |
| Daemon types | manager + engine + replica | io-engine + control + diskpool | MON + MGR + OSD + (MDS, RGW) | controller + satellite | controller + node |
| Hardware constraints | None unusual | HugePages, kernel ≥ 5.15, dedicated raw devices, 2 dedicated cores per IO engine | Dedicated 10GbE+ strongly recommended | DRBD kernel module | NVMe drive |
| License | Apache 2.0 | Apache 2.0 | Apache 2.0 | GPL (DRBD/LINSTOR), Apache 2.0 (Piraeus) | Apache 2.0 |
| File / object support | Block + RWX via NFS shim | Block only | Block + File (CephFS) + Object (RGW) | Block only | Block only |
| Snapshots / clones today | Yes (with S3 backup) | Yes | Yes | Yes | Planned (open source) |
| Published reference IOPS | ~19K (NVMe/10GbE) | ~28K (NVMe/10GbE) | ~32K (NVMe/10GbE) | "near-native" | Not yet published |
Longhorn¶
Longhorn is a turnkey replicated block storage solution with a built-in web UI and automated backup to S3. It uses a custom userspace longhorn engine to handle replication. Published third-party benchmarks report ~19K IOPS on NVMe with 10GbE networking.
Use Longhorn when you need a turnkey storage UI and S3 backup baked into the storage layer rather than added as a separate tool.
MeshStor differs by keeping the entire data path in the kernel — no userspace longhorn engine. Documented Longhorn failure modes related to its userspace engine (such as read-only filesystem bugs reported under sustained load) do not apply to MeshStor's MD RAID + NVMe-oF data path. MeshStor does not currently provide a built-in UI or built-in S3 backup; if those are core requirements, Longhorn fits better today.
OpenEBS Mayastor¶
Mayastor is the closest architectural sibling to MeshStor — both are NVMe-oF based replicated block storage. The mechanism is different: Mayastor uses SPDK in userspace with two dedicated cores per IO engine running pollers regardless of load, requires HugePages, requires Linux kernel 5.15 or newer, and requires dedicated raw block devices per DiskPool (a DiskPool cannot span multiple devices). Published third-party benchmarks report ~28K IOPS on NVMe with 10GbE networking.
Use Mayastor when you can dedicate cores to storage IO and you want the most mature NVMe-oF userspace stack available today, including snapshots and clones.
MeshStor differs by using the in-kernel NVMe-oF target and Linux MD RAID. There is no core pinning, no HugePages requirement, GPT partitions are used in place of dedicated raw devices, and there is no fixed CPU consumption from poller threads. MeshStor does not currently provide snapshots or clones — those are planned for the open-source roadmap, see Project Status.
Rook-Ceph¶
Rook-Ceph is the kitchen-sink storage option. A single Rook-Ceph cluster provides block (RBD), file (CephFS), and S3-compatible object (RGW) storage. It supports erasure coding for capacity efficiency, asynchronous RBD mirroring for disaster recovery, encryption at rest, metro stretch clusters, and a long list of other capabilities. Published third-party benchmarks report ~32K IOPS on NVMe with 10GbE networking — the highest of any open-source option in this comparison.
The cost is operational complexity. Rook-Ceph requires understanding Ceph fundamentals — CRUSH maps, placement groups, OSD lifecycle, monitor quorum. It runs multiple daemon types (MON, MGR, OSD, and optionally MDS for CephFS and RGW for object). Dedicated 10GbE or faster networking is strongly recommended.
Use Rook-Ceph when you need multi-protocol storage (block + file + object), when you need erasure coding for cost-efficient capacity, when you need cross-site disaster recovery, or when you already standardize on Ceph elsewhere in your infrastructure.
MeshStor differs by being purpose-built for replicated block only. There is a single binary, no separate storage cluster, no CRUSH maps, no PG tuning. MeshStor is the right answer when you want the operational simplicity of "another Kubernetes Deployment" rather than "a distributed storage system that happens to run on Kubernetes". It is the wrong answer when you need the multi-protocol surface or the depth of features Ceph provides.
LINSTOR / Piraeus¶
LINSTOR is the closest functional sibling to MeshStor — both perform replication in the kernel. LINSTOR is built on DRBD, which has 20+ years of in-tree provenance in Linux HA clusters. Piraeus is the Apache 2.0 Kubernetes operator that wraps LINSTOR. The combination supports multiple storage backends (LVM, LVM Thin, ZFS, ZFS Thin), three replication modes (synchronous, asynchronous, semi-synchronous for WAN DR), snapshots and clones available today, LUKS encryption, and TLS for all replication traffic. GigaOm rated LINBIT a Leader in the 2024 Kubernetes Data Storage Radar.
Use LINSTOR when you need WAN-distance asynchronous replication for cross-site disaster recovery, when you need ZFS-backed storage pools, or when you need snapshots, clones, or encryption that work today.
MeshStor differs by using the in-kernel MD RAID subsystem rather than DRBD. MD RAID has no out-of-tree kernel module dependency — it is part of the mainline Linux kernel everywhere. MeshStor uses GPT partitions on raw NVMe drives directly, which is a simpler hardware setup than configuring an LVM volume group or ZFS pool. The entire MeshStor stack is Apache 2.0; LINSTOR's underlying DRBD and LINSTOR engine are GPL while the Piraeus operator wrapper is Apache 2.0. MeshStor uses a Kubernetes-native CRD model (MeshStorVolume, MeshStorNodeDevice) rather than wrapping a separate satellite/controller deployment.
LINSTOR is mature; MeshStor is in Technical Preview. If you need production support today and the LINSTOR feature set covers your needs, LINSTOR is the lower-risk option.
MeshStor is the right answer when you specifically want kernel-grade replication without operating a separate storage cluster, you accept Technical Preview maturity, and the operational simplicity of a single binary outweighs the maturity gap.
Local storage CSIs¶
Compared: TopoLVM, OpenEBS LocalPV-LVM, local-path-provisioner.
This section makes the case that MeshStor with numberOfCopies=1 is a strict superset of pure local storage CSIs, and reserves pure local storage for narrow exception cases.
Capability table¶
| Capability | TopoLVM | OpenEBS LocalPV-LVM | local-path-provisioner | MeshStor numberOfCopies=1 |
|---|---|---|---|---|
| Pod can reschedule to another node (drain, eviction, OOM, taints) | No — PVC is pinned to the node | No | No | Yes — partition relocates |
| Survives full node loss (disk failure, hardware death) | No | No | No | No (needs numberOfCopies ≥ 2) |
| Snapshots / clones | LVM-local | LVM-local | No | Cross-node (planned, open source) |
| Online expansion | Yes (LVM extend) | Yes | No | Yes |
| Hardware requirements | LVM volume group | LVM volume group | Any directory | NVMe drive |
The honest answer on the second row matters: numberOfCopies=1 is not a substitute for replication. The advantage is that the pod is no longer pinned to one node — every other dimension where local storage seemed simpler is actually a tie or a MeshStor win.
If you don't need replication, use MeshStor anyway¶
A numberOfCopies=1 MeshStor volume goes through the same data path as a replicated volume but only writes to one underlying partition. You get all of the operational features of the replicated mode — pod rescheduling across nodes, partition relocation on drain, future cross-node snapshots — without paying the replicated-write multiplier.
The cost in steady state is the MeshStor data-path overhead (XFS → MD single member → GPT partition → NVMe drive) compared to a pure local volume. Whether that overhead matters for your workload depends on the workload. For most workloads it does not — see Performance for the overhead breakdown.
Even for app-replicated databases, numberOfCopies=1 beats pure local storage¶
Picture a 3-node PostgreSQL Patroni cluster. One of the nodes hits memory pressure and Kubernetes evicts the database pod. The node itself is still alive — its disks are fine, kubelet is healthy — the only problem is that the database pod can't run there anymore.
With pure local storage CSIs (TopoLVM, OpenEBS LocalPV-LVM, local-path-provisioner): the PVC is topology-pinned to the original node. The pod cannot reschedule anywhere else. Either it stays Pending until the original node has free memory again, or the operator deletes the PVC and forces Patroni into a full pg_basebackup onto a fresh local volume on another node. During that basebackup, the rebuilding member serves no queries, and one of the two surviving members is tied up serving the backup — leaving the cluster running on a single full-speed replica until the rebuild finishes. Soft eviction has triggered the same recovery cost as a disk failure would have.
With MeshStor numberOfCopies=1: the pod reschedules onto another node. MeshStor mounts the original partition remotely over NVMe-oF on the new node, and the pod resumes immediately reading the original data at slightly degraded latency. The relocation moves the partition to the new node in the background. The two surviving members stay at full speed throughout; the relocated member is mildly degraded only during the relocation window. No DB-level rebuild is needed.
Scoping note. This scenario assumes the original node is alive but unable to host the pod (drain, memory pressure, OOM, taint). If the node is fully lost (disk failure, hardware death),
numberOfCopies=1loses data exactly like pure local storage. To survive full node loss, usenumberOfCopies ≥ 2.
MeshStor's data path is categorically faster than LVM thin¶
Most local storage CSIs in production use LVM as the volume manager. TopoLVM uses LVM logical volumes, often thin-provisioned to support snapshots. OpenEBS LocalPV-LVM does the same. The choice between LVM-linear and LVM-thin shapes a large part of local-CSI performance, and LVM-thin's overhead is substantial even before snapshots enter the picture.
Published third-party measurements of dm-thin overhead:
- Small random writes on EBS gp3 (LINBIT/LINSTOR benchmark, 2024): thick LVM 3,093 IOPS vs thin LVM 1,650 IOPS — a 46.7% reduction.
- 4K random reads on an 8-SSD RAID0 array (Dale Stephenson,
linux-lvm, 2017): thick/linear 251,303 IOPS vs thin 146,792 IOPS — a 41.6% reduction. - fsync-heavy writes (Yarden Maymon): dm-thin with
O_SYNCat 50% IOPS degradation (438K → 204K, vs 451K on raw disk). Everyfsync()on dm-thin triggers two full device flushes — data device and metadata device — instead of one. - Ramdisk-class IO rates (Joe Thornber, dm-devel, 2011): pure software overhead visible at ~38% throughput reduction (fully-allocated thin 5,742 MB/s vs linear 9,258 MB/s) — the dm-thin tax with storage latency eliminated.
The root causes are architectural: dm-thin defers every write to a single pool worker thread, double-locks each bio in the bio-prison under a global spinlock for virtual and physical key tracking, and performs btree metadata lookups on the IO path. On NVMe (20–70 µs device latency), dm-thin's software overhead consumes 10–30% of total IO time, making the tax acutely visible.
MeshStor's numberOfCopies=1 data path is XFS → MD (single member) → GPT partition → NVMe drive. With bitmap=internal enabled — the MeshStor default — the MD layer adds a few function-pointer indirections plus write-intent bitmap bookkeeping. On SSD/NVMe the remaining overhead is primarily bitmap lock contention rather than disk seek penalty, and recent kernel work on bitmap locking (2024 LWN, Yu Kuai) reduced equivalent RAID5-on-PCIe-4.0-SSD contention by 89.4% throughput and 85% p99.99 latency — so the overhead is trending down with each kernel release.
The net picture. MeshStor's RAID1 single-leg with internal bitmap lands in the single-digit-percent overhead range on NVMe. LVM-thin-based local CSIs pay a 40–50% IOPS tax on small random writes and roughly 2× flush cost on every fsync. On any OLTP or fsync-heavy workload, MeshStor's data path is categorically faster.
Snapshot cost model¶
Beyond steady-state throughput, LVM thin pays the snapshot tax continuously. dm-thin carries metadata-lookup overhead even when no snapshot has been taken yet, because the volume still uses the dm-thin metadata layer. Once a snapshot exists, every first-touch write to a shared chunk triggers a full copy-on-write cycle: read the entire shared chunk, write it to a new location, update the btree mapping, then submit the original user write.
The write amplification is brutal. For a 4 KiB user write against a 64 KiB shared chunk (LVM thin's default), the amplification is ~33×. At 512 KiB chunks it reaches ~256×. On SSD each first-touch COW adds 100–500 µs per chunk; on HDD, seeks compound and each COW costs 15–25 ms.
MeshStor's planned snapshot mechanism pays its cost at snapshot creation, not at write time. The implementation temporarily adds one extra disk into the RAID array, takes a sub-second xfs_freeze for a consistent point, and removes the temporary member. During the resync window, foreground writes pay a straightforward 2× amplification and contend moderately with the resync traffic. Once the snapshot is detached, normal writes return to zero per-write overhead.
For evaluators with write-heavy workloads who plan to use snapshots, the difference is paying an amortized 33–256× tax on every first-touch write forever versus paying a lump-sum resync cost once. For workloads that don't touch every chunk after a snapshot, MeshStor also wins on total bytes written.
Exception: local-path-provisioner uses raw hostPath directories with no volume manager at all. That is the fastest possible local storage path — but it has none of the operational features (no expansion, no snapshots, no quota enforcement, no relocation) and no data protection of any kind.
So when is pure local storage still the right choice?¶
A short, narrow list:
- You cannot run the MeshStor controller and DaemonSet at all. Extremely resource-constrained edge nodes where the control-plane footprint is unacceptable.
- You are already running TopoLVM or OpenEBS LocalPV-LVM in production and the migration cost outweighs the operational benefit of switching.
- You need
local-path-provisionerfor ephemeral CI runners and dev clusters where you genuinely don't need any data services. That is a fine use oflocal-path-provisionerand MeshStor would be overkill.
Outside of these cases, numberOfCopies=1 MeshStor is the better default for unreplicated workloads.
What's Next¶
- Overview — the three differentiators that define MeshStor
- Use Cases — gut-check on whether MeshStor fits your situation
- Project Status — Technical Preview maturity statement
- Performance — architectural overhead breakdown
- Installation — deploy MeshStor and create your first volume