PostgreSQL page size for SSD (2)

I have posted recently performance figures with varying page size using pgbench on SSD, which show a +10% improvement with smaller 4 kB page size over the default 8 kB page size.

Josh Berkus pointed out that pgbench test uses rather small 100-bytes rows, and that changing the tuple size might induce a different conclusion. To assess this point, I ran some tests with different row sizes: 1 kB, 3 kB, 7 kB and 15 kB.

Again, the test focuses on a small GBs base which fits in memory, so what is exercised is mostly random writes (i.e. no random reads, no sequential reads and writes).

Test Setup

I ran prepared transactions (-M prepared) to avoid cpu cycles and focus on write performance. The scale is 10 for 1 kB rows (about 1 GB base), 10 for 3 kB rows, 5 for 7 kB rows and 2 for 15 kB rows (about 4 GB bases): bases are small enough to fit in memory, so that reads should not require disk accesses. I ran 200 seconds tests with 4 clients (-c 4) on Postgresql 9.4b2. Although this is short, I am confident that it is enough because the rather good performance of SSDs implies dozens of checkpoints during that time, and also prior tests I did with 1000 vs 200 seconds on SSDs showed that the later were consistent with the former, although slightly on the optimistic side. I lowered autovacuum_naptime to 20 seconds and checkpoint_timeout to 1min and raised checkpoint_completion_target to 0.8 to shake things a bit, but I do not think that there is a great influence of these parameters given the overall performance. I used the default configuration for everything else. The hardware is the same as the previous test.

The row size is changed by updating the filler attributes of all tables with commands like:

ALTER TABLE pgbench_accounts
  DROP COLUMN filler,
  ADD COLUMN filler CHAR(1000) NOT NULL DEFAULT '',
  ALTER COLUMN filler SET STORAGE EXTERNAL;
VACUUM FULL pgbench_accounts;

The key point is to change the STORAGE setting, otherwise with the default EXTENDED value TOAST compression kicks in after 2 kB and with the default empty string rows end up smaller than with the standard test…

Results

Here are the performance results for two series of test.

The first serie uses simple updates (-N) to focus on write performance without lock contention or similar issues. Note that performance from different sets are not under identical conditions, so they should be compared with care. I am focusing on relative performance with differing page sizes (̀blocksize in PostgreSQL lingua).

First a reminder with 100-bytes rows on 10 1000-seconds runs.

bs    avg & stdev    min   median    max     perf
 2   999.6 ± 21.7   955.7  1005.2   1025.1  +12.3%
 4   982.5 ± 15.6   945.7   983.8   1001.2  +10.4%
 8   890.2 ± 18.6   857.6   886.0    922.3    =

With 1 kB rows on 10 200-seconds runs:

bs    avg & stdev    min   median    max     perf
 2  1004.5 ± 14.7   982.3  1010.1  1021.7   +32.5%
 4   953.0 ± 25.2   909.2   963.8   979.5   +25.7%
 8   757.8 ± 27.2   721.6   749.3   804.6     =

With 3 kB rows on 10 200-seconds runs:

bs    avg & stdev    min   median    max     perf
 2   878.8 ± 29.0   839.6   884.4   927.4   +10.4%
 4   858.0 ± 24.6   818.1   858.2   894.4   + 7.8%
 8   795.7 ± 32.5   760.3   791.7   873.9     =

With 7 kB rows on 15 200-seconds runs:

bs    avg & stdev    min   median    max     perf
 2   718.4 ± 37.9   605.9   723.8   774.2   +11.8%
 4   669.2 ± 48.8   581.9   668.6   722.2   + 4.1% (significant?)
 8   642.6 ± 51.7   573.8   642.4   742.3     =

With 15 kB rows on 10 200-seconds runs:

bs    avg & stdev    min   median    max     perf
 2   515.8 ± 57.6   426.4   502.0   616.2   + 1.2% (significant?)
 4   491.0 ± 28.8   447.9   485.4   557.7   - 3.7% (significant?)
 8   509.8 ± 28.2   468.0   518.5   542.0     =

Detailed every-second tps measures become more unstable the larger the rows, especially with 7 kB and 15 kB rows. Differences are mostly consistent and significant but for some cases such as the 4/8 bs for 7 kB rows which show a large overlapping standard deviation.

The second test serie generates heavier updates by running a custom script similar to the simple update version but which also updates the filler attribute, so that the TOAST-ed value is also changed.

With 1 kB rows on 20 200-seconds runs:

bs    avg & stdev    min   median    max     perf
 2   738.1 ± 69.2   677.2   705.3   925.7   - 6.5%
 4   788.5 ± 83.9   671.7   769.7   924.3   - 0.1% (not significant?)
 8   789.6 ± 99.6   681.0   779.9   939.3     =

With 3 kB rows on 12 200-seconds runs:

bs    avg & stdev    min   median    max     perf
 2  444.1 ± 170.5   299.4   338.5   702.0   +35.8%
 4  429.6 ± 106.6   316.5   374.4   623.7   +31.3%
 8  327.1 ± 18.5    310.0   318.2   365.6     =

For the record, with 7 kB rows on 50 200-seconds run:

bs    avg & stdev    min   median    max     perf
 4  292.6 ± 156.7   125.2   191.5   634.6   +15.1% (inconsistent)
 8  254.1 ± 99.9    115.6   229.1   437.1     =

Note that these heavy updates are much less stable than the previous set, with a pretty large standard deviation coming from the wide performance range for small page sizes, and some inconsistent results.

Conclusion

In the above test series, the default 8 kB page size beats 4 kB page size once, for 15 kB tuples, by a few percents. One other case where it seems slightly better is not significant. Although there may exist settings for which a 8 kB page size is better than a smaller page size on a SSD with this kind of OLTP load, I have not really found them. This conclusion is consistent with prior reports by Tomas Vondra and Hans-Jürgen Schönig.

From these limited evidences, considering smaller pages seems a better idea on SSDs: As random writes are much better on SSDs compared to HDDs, there is much less insentive to factor up things in large pages on that account. Also, smaller pages will reduce SSD wear-out.

2014-08-19: Note, yet again pointed out by Josh Berkus: these tests are particular. I can yet again only agree, but that is what I can do with the available SSD hardware (i.e. my laptop:-). These tests focus on transactions which generate writes to WAL and random writes to update modified pages on CHECKPOINTs. First, as the up to 4 GB bases fit in memory, random reads are not really tested. Other Read-only SSD tests by Tomas Vondra point to performance gains of about +60% for 4 kB pages over 8 kB pages. Although each measure is one shot and the WAL block size is also modified, there are many tests with different settings which are all consistent for the setup he is using. Yet again, this is a particular test. Second, there are no sequential reads and writes, pgbench really generates a transactional load with simple SELECTs, UPDATEs and INSERTs. Probably big analytics query should like large pages more. This remain to be checked.

rule