Skip to content

Tuning

MeshStor works out of the box. The knobs below are optional performance tuning for operators on a fast fabric (10 GbE and above). They are host-side policy that MeshStor deliberately does not touch — the operator is best placed to apply them.

Block Device Sysfs

MeshStor sets XFS mount/mkfs/mdadm options that are workload-agnostic, but the per-device sysfs knobs below are host-side policy.

# Apply to every NVMe namespace MeshStor touches AND to the MD device
# after assembly. Adjust the device list to your setup.
for d in nvme0n1 nvme1n1 md0; do
    # NVMe uses blk-mq; any scheduler adds latency
    echo none > /sys/block/$d/queue/scheduler 2>/dev/null || true
    # small random I/O; default 128 KiB wastes fabric bandwidth
    echo 16 > /sys/block/$d/queue/read_ahead_kb
    # strict-CPU completion; cuts IPI / cache bouncing
    echo 2 > /sys/block/$d/queue/rq_affinity
    # remove entropy-collection overhead
    echo 0 > /sys/block/$d/queue/add_random
    # writeback throttling; NVMe is too fast for WBT
    echo 0 > /sys/block/$d/queue/wbt_lat_usec
done
# Deep queue for OLTP concurrency; only on the NVMe components
for d in nvme0n1 nvme1n1; do
    echo 1023 > /sys/block/$d/queue/nr_requests
done
# Resync rate bounds (apply globally)
echo 50000    > /proc/sys/dev/raid/speed_limit_min
echo 1000000  > /proc/sys/dev/raid/speed_limit_max

For RAID10 arrays, also raise group_thread_cnt to the member count — without it a single mdX_raid10 kernel thread caps throughput above ~300–400K IOPS. Persist via udev (the sysfs file is re-created on every assembly):

sudo tee /etc/udev/rules.d/20-md-raid10-tuning.rules > /dev/null <<'EOF'
ACTION=="add|change", KERNEL=="md*", ATTR{md/group_thread_cnt}="4"
EOF

sudo udevadm control --reload-rules
sudo udevadm trigger

Why 4?

Match group_thread_cnt to the number of RAID10 members (4 for a 2-copies × 2-drives-per-copy volume). For RAID1 or larger RAID10 geometries, adjust accordingly.

TCP Buffers

For NVMe/TCP over a fast fabric, kernel TCP buffer defaults are too small. Bandwidth-delay product at 100 Gbit × 1 ms is 12.5 MiB; recommended maxima give headroom across many queues:

sudo tee /etc/sysctl.d/99-nvme-tcp.conf > /dev/null <<'EOF'
net.core.rmem_max           = 268435456
net.core.wmem_max           = 268435456
net.ipv4.tcp_rmem           = 4096 262144 134217728
net.ipv4.tcp_wmem           = 4096 262144 134217728
net.core.netdev_max_backlog = 300000
net.core.somaxconn          = 4096
EOF

sudo sysctl --system

NIC

For lowest NVMe-oF tail latency on small synchronous I/O — the shape of traffic that databases produce — disable LRO (which breaks NVMe/TCP PDU framing when it coalesces segments above the stack), tighten RX coalescing, and pin NIC IRQs to NUMA-local cores.

Runtime configuration via ethtool (replace enp1s0f1np1 with your storage interface):

# LRO coalesces segments above the stack → breaks NVMe/TCP PDU framing;
# GRO stays on.
sudo ethtool -K enp1s0f1np1 lro off

# Tight RX coalescing for lowest tail latency.
sudo ethtool -C enp1s0f1np1 adaptive-rx off rx-usecs 8 rx-frames 16

# Pin NIC IRQs to NUMA-local cores; stop irqbalance.
sudo systemctl disable --now irqbalance

ethtool settings do not persist across reboots — wire them into a systemd unit or a NetworkManager connection.

Persistent configuration via NetworkManager (for systems where the storage NIC is managed by NM):

sudo nmcli connection modify enp1s0f1np1 \
    ethtool.coalesce-rx-usecs 0 \
    ethtool.coalesce-tx-usecs 0 \
    ethtool.coalesce-adaptive-rx 0 \
    ethtool.coalesce-adaptive-tx 0

For systems not managed by NetworkManager, wrap the ethtool commands above in a systemd unit so they re-apply on every boot.


What's Next