Skip to content

Performance

This page describes the architectural performance characteristics of MeshStor — derived from the data path, not measured in production. For the maturity context and benchmark status, see Project Status. To run your own numbers in about an hour, skip to "How to get real numbers in an hour" below.


The data path in one diagram

sequenceDiagram
    participant Pod
    participant XFS
    participant MD as MD RAID
    participant Local as Local NVMe-oF target
    participant Net as Network
    participant Remote as Remote NVMe-oF host
    participant Disk as Backing partition

    Pod->>XFS: write() syscall (memcpy in)
    XFS->>MD: bio submit (zero-copy)
    MD->>Local: write to local member (zero-copy)
    MD->>Net: write to remote member (NVMe-oF framing)
    Net->>Remote: TCP / RDMA (kernel)
    Remote->>Disk: bio submit (zero-copy)
    Disk-->>Remote: completion
    Remote-->>Net: NVMe-oF completion
    Net-->>MD: ack
    Local-->>MD: ack
    MD-->>XFS: completion
    XFS-->>Pod: write() returns

The above is for a write to a numberOfCopies=2 volume. The latency the pod sees is max(local_latency, remote_latency) — both copies are written in parallel. For numberOfCopies=1, the diagram collapses to the local-only branches.


Where MeshStor should be fast

  • The data path is in the kernel from end to end. No userspace IO engine, no SPDK pollers, no proxy daemon. Each hop is either a kernel function call or a kernel network send.
  • No copy-out and copy-in for replication. The MD layer issues bios directly to the local block device and to the NVMe-oF host driver. There is no userspace memcpy in the write path.
  • RDMA is supported when available. When both nodes have RDMA-capable NICs and the cluster is annotated correctly, NVMe-oF uses RDMA on port 4421. RDMA latency is typically a small number of microseconds vs. tens of microseconds for TCP.
  • No per-IO-engine core pinning. Unlike user-space SPDK-based stacks, MeshStor's data path can yield to other work on the node. There is no constant 100% CPU consumption from poller threads.

Where MeshStor will be slower than you hope

  • Small random writes hit the MD journal. RAID1 and RAID10 both write to all members synchronously. The latency is bounded by the slowest member in steady state.
  • First-write-after-create allocates blocks. XFS allocates extents on first write, which adds metadata IO. This is true of any XFS volume, but it's worth knowing about when interpreting your first benchmark run.
  • The replication multiplier is real. A numberOfCopies=3 write puts pressure on three nodes' network and three backing devices. Throughput is bounded by min(network_per_node, disk_per_node).
  • Cross-rack latency dominates RDMA's advantage. RDMA reduces the per-packet latency, but if your remote copy is in another rack with 100µs of switching latency, the per-packet improvement is in the noise.

How to get real numbers in an hour

The fastest way to know how MeshStor performs on your hardware is to run fio on a freshly created volume. The recipe below covers the three workload patterns that matter for most evaluations.

Step 1: Create a test volume

kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: meshstor-bench
spec:
  accessModes: [ReadWriteOnce]
  storageClassName: mesh-2copy-tcp
  resources:
    requests:
      storage: 50Gi
EOF

Step 2: Run fio in a debug pod

kubectl run fio-bench --rm -it --image=ljishen/fio --restart=Never \
  --overrides='{"spec":{"containers":[{"name":"fio-bench","image":"ljishen/fio","volumeMounts":[{"name":"v","mountPath":"/data"}],"command":["sleep","3600"]}],"volumes":[{"name":"v","persistentVolumeClaim":{"claimName":"meshstor-bench"}}]}}'

# in another terminal:
kubectl exec -it fio-bench -- fio \
  --name=randread --filename=/data/test.bin --size=10G \
  --bs=4k --rw=randread --ioengine=libaio --iodepth=32 \
  --runtime=60 --time_based --direct=1

Repeat with --rw=randwrite and --rw=write --bs=1M --iodepth=8 for the random-write and sequential-write patterns.

Step 3: What to look at

  • iops and lat (avg) for random read — this is the latency-sensitive number that matters for OLTP databases.
  • iops for random write — bounded by replication multiplier.
  • bw for sequential write — bounded by min(network_per_node, disk_per_node) ÷ N.

What to ignore

  • The first 5–10 seconds of any run. XFS allocation, MD initial sync, and TCP slow-start all skew early measurements.
  • --bs=512 results unless you genuinely run 512-byte IO. Below the device's logical block size you measure the kernel's block-merging behavior, not MeshStor's data path.

What's Next

  • Architecture — the layered data path this page analyzes
  • Replication — RAID-level performance characteristics
  • Comparison — head-to-head with other replicated and local CSIs