Bug 4190 - ZFS storage defaults cause poor performance
Summary: ZFS storage defaults cause poor performance
Status: UNDECIDED
Alias: None
Product: pve
Classification: Unclassified
Component: zfs (show other bugs)
Version: 7
Hardware: PC Linux
: --- bug
Assignee: Bugs
URL:
Depends on:
Blocks:
 
Reported: 2022-08-04 17:29 CEST by Jim Salter
Modified: 2024-08-09 22:58 CEST (History)
7 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jim Salter 2022-08-04 17:29:39 CEST
Proxmox VE on ZFS defaults to creating new disks as zvols with volblocksize=8K.

This is an extremely underperforming configuration for most workloads, resulting in a lot of Proxmox users complaining of unexpectedly bad storage performance. Although zvols seem tailor-made for virtualization, in practice they tend to woefully underperform as compared to raw files on ZFS datasets, when recordsize on the dataset is equivalent to volblocksize on the zvol.

8K volblocksize/recordsize is also an extremely suboptimal configuration for most generic workloads. While likely appropriate for a dedicated PostgreSQL VM (which uses 8KiB pages by default), this blocksize results in poor compression performance and unnecessary IOPS on nearly any other workload—including but not limited to general purpose file servers, most web server applications, mail servers, virtual desktops, and more.

You can find a relatively brief Twitter thread (including charts of fio performance tests) here: https://twitter.com/jrssnet/status/1554939767661404160

Suggested improvements include:

* not defaulting to zvol storage
* not defaulting to extremely low blocksizes—compare and contrast with, eg, qemu default cluster_size of 64KiB
* making usage of raw files on ZFS datasets more discoverable and manageable in the UI itself
* basic documentation—ideally including in-UI tooltips—guiding users on appropriate configuration of blocksize settings for different workloads
Comment 1 Fabian Grünbichler 2022-08-05 12:31:59 CEST
you are comparing apples to oranges here (and there are no details what you actually benchmarked - so it's a bit hard to give a concrete reply).

volblocksize and recordsize are not the same thing (else they'd be a single property ;)), the former is a blocksize, the latter is an upper limit for records. what that means in practice that increasing recordsize is almost always fine (it reduces the metadata overhead for I/O done in big chunks, while I/O with small blocks works fine), the same is absolutely not true for volblocksize unless your workload is tuned to match the increased block size (e.g., if your VM uses 4k blocks internally but the the zvol uses 64k blocksize, you will see immense write amplification). 8k is a reasonable compromise for volblocksize and the upstream default, but it is configurable if you need it. that qcow2 has a default of 64k for its cluster_size is also the upstream default - and the resulting write amplification is something that has been criticized a lot, resulting in "sub-cluster" allocations as an opt-in feature: https://blogs.igalia.com/berto/2020/12/03/subcluster-allocation-for-qcow2-images/

there are some performance issues/regressions with zvols reported upstream:
- https://github.com/openzfs/zfs/issues/12483
- https://github.com/openzfs/zfs/issues/8472 

and we have considered offering "raw image on dataset" as an alternative in the past, but nobody has implemented it yet (the manual way of backing a dir storage with a ZFS dataset obviously already exists, but doesn't get the full range of ZFS benefits like snapshots and replication). we use a similar approach for our (preview) BTRFS plugin.

regarding docs, there is a small part available in our admin guide:
https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysadmin_zfs_raid_considerations

although that mainly concerns another volblocksize-related pitfall in combination with raidz ;)

so to summarize:
- when reporting issues based on benchmarks, please include the full benchmark details (commands run, storage setup, relevant config files)
- increasing the default volblocksize won't happen
- offering a "ZFS using datasets" option for VMs is definitely doable, but someone needs to write the patches for that ;)
Comment 2 roland.kletzing 2022-09-28 13:01:27 CEST
>- offering a "ZFS using datasets" option for VMs is definitely doable

yes please!

we are using qcow2 on zfs datasets exclusively for quite a while.  i know this does not provide optimum performance, but we are getting the best of "two worlds" (as raw virtual disk does not provide snapshots) regarding features and ease of use.  

we also do local snapshot of all zfs datasets with sanoid and also do replication with syncoid.

we dislike using zvols

please consider adding/managing zfs datasets via gui
Comment 3 roland.kletzing 2022-09-28 13:03:07 CEST
oh, wtf. ticket is by the sanoid/syncoid author himself :D :D :D

i like it very much, it runs perfectly reliable. thanks for making it !