08 August 2014

In a previous post, I have outlined the time required by a PostgreSQL database to warm-up from a HDD on a read-only load for a database that fits in memory.

In this post, I want to look at write performance on SSD, focusing on the impact of PostgreSQL page size (blocksize), and on checking whether the current 8 kB default is relevant.

Clarification note: this post focuses on a simple small tuples OLTP read-write (mostly writes) load generated with pgbench on a base that fits in memory. Conclusions really apply to this case.

## Other PostgreSQL SSD Tests

I got interested by reading a post by Hans-Jürgen Schönig who tested very large 1 MB block as well as standard page sizes. What struck me is that there is a significant impact of smaller page size on OLTP performance (6% better for 4 kB, 10% better for 2 kB, with 8 kB as the reference), suggesting that the 8 kB default is not necessarily the best choice, especially when using SSD. However the test is on a relatively small database (pgbench scale factor 10 gives about 150 MB), a short time (typically 1 minute), and with poor performance (less than 150 tps) given that a SSD was used.

I also found an older but excellent post by Tomas Vondra who played with the file system options as well as PostgreSQL page and WAL block size. The results show a +40% performance boost for 4 kB pages wrt 8 kB pages for read-write loads on a SSD. The 300 pgbench scaling is significant, reported runs seem to be 5 minutes long, and the +1000 tps performances are consistent with the SSD hardware. The downside is that both page and WAL block sizes are moved together, and only one shot performances seem to be reported, although my past experience with pgbench makes me wary of performance reproducibility and stability.

## Test Setup

I choose pgbench simple updates with prepared transactions (-N -M prepared) to focus on write performance without contention interference, scale 100 so that the database is reasonably large but can be kept in memory, 4 clients (-c 4) because my laptop has 2 HT cores. I run a number of 1000 seconds tests so as to derive hopefully meaningful statistics. With the SSD performance the 1000 seconds tests basically overwrite a majority of rows in the account table.

I used the default configuration for nearly everything (ext4 4 kB block size, mount options, PostgreSQL 9.4b2): I am interested in relative performance with respect to the varying page size, not absolute best performance, and it was simpler that way. In particular, using a journaled file system induces a significant performance penalty. The SSD hardware is a Lite-On IT LMT-256M6M mSATA 256GB on a Dell XPS 13 laptop.

## Results

Here are the performance results for 10 runs, for page sizes 2, 4 and 8 kB. It shows a 10.4% boost for 4 kB pages, and an additional 1.9% boost for 2 kB pages.

bs    avg & stdev    min    median    max
2   999.6 ± 21.7   955.7   1005.2   1025.1
4   982.5 ± 15.6   945.7    983.8   1001.2
8   890.2 ± 18.6   857.6    886.0    922.3


Detailed every-second tps measures with --progress=1 show a reasonable performance stability during a run, with periodic but not too deep ups and downs, similar to what is shown by Tomas Vondra. I also ran 200 second tests which show simular results, although with a lower minimum performance and also average, maybe an underlying variability which is smoothed by the longer run.

The first improvement is consistent for all measures (average, minimum, median, maximum) and well outside the standard deviation, thus although the number of runs is limited, I am confident that it is significant. The second improvement is also consistent, but confidence is lower. Moreover, as ext4 underlying blocks are 4 kB, the rational for using 2 kB pages is unclear.

## Analysis

Why would you want a larger page size? Filling tuples in a page is better as relatively less space is lost at end of pages, and there are less header overheads. Toasting large values will require less pages. Indexes can hold more values and be more selective with larger pages. Some maintenance algorithm complexity (eg vacuuming, checkpoint decisions…) depend on the number of pages.

Why would you want a smaller page size? When reading a page for just one tuple, you fill up the cache with the full page size, so less hot tuples may be kept in memory. When updating a tuple, a larger part of the cache is dirtied and must be written back to disk at some point.

There is a trade-off, not a surprise. However, the performance implication of random IOs is not necessarily the same depending on the underlying hardware: random read/write performance on HDDs would be about a few percent of sequential read/write, whereas on SSDs it could be up to more than 50% of the sequential write figure depending on the queue depth, see for instance Tom’s Hardware reviews.

For HDDs, random reads and writes which occur when pages are accessed and updated have a very high cost, the overhead of reading or writing being dominated by the disk rotation seek time. In that context, reading or writing 8 kB or 4 kB costs the same and costs dearly, so let’s do more: the cache will be filled up more quickly, which may help improve the hit ratio and thus performance. I think that it is the underlying logic of this choice.

For SSDs, as the cost of random reads and writes is much lower as compared to HDD, there is less insentive for large pages on that account.

## Conclusion

There is a significant positive impact on performance on SSDs with a smaller 4 kB page size for a read-write OLTP load as exemplified by pgbench (with the limitations of this particular benchmark, see the note below). However, the performance boost is not as large as the figures found by Tomas Vondra.

This suggest that the ability to tweak this parameter and possibly also the WAL block size more easily than by recompiling PostgreSQL could be a good move.

The next question is what happen when playing with PostgreSQL page size on HDDs. This may be the subject of another post.

Note: Josh Berkus pointed out that this does not prove that smaller page sizes are always beneficial, esp. as pgbench uses rather small 100-bytes tuples, and that the base itself is small. I can only agree! Other benchmarks could lead to other conclusions about the best page size. My point is more that it should be easier to tweak the page size as there is a consistent and significant impact on performance. I ran more tests with larger rows, although still with a small base, which show some benefit with smaller pages.

2014-08-17: Follow-up

2016-04-10: Russian translation by Компьютерная помощь.