11 December 2016

PostgreSQL 9.6 bloom is an extension contributed by Teodor Sigaev, Alexander Korotkov and Oleg Bartunov which provides a new index type for integer and text columns. There is some coverage on how to use it and how it works, which is good because documentation is scarse.

This blog entry describes briefly how the index works and discusses how to choose the parameters associated to this index, namely the signature size (default 80, allocated in chunks of 16 bits) and the number of bits for each indexed column (default 2), with a rule of thumb approach based on pretty rough approximations.

The short version is that there is a simple formula to suggest the number of bits of a column based on its entropy, and then another formula to compute the best signature size, the result of which varies widely from SSD to HDD. You will have to read further to get them:-)

PostgreSQL Bloom Index

A Bloom filter is a compact bit array which represents a set of values. The structure allows to test whether an element probably belongs to the set, with some false positives but no false negatives: the test may return true although the element is not in the set, but cannot return false if it is.

A Bloom index created for a set of columns can be used to speed-up queries for a conjunction (AND) of equalities (=) on a subset of these columns:

CREATE EXTENSION bloom;
CREATE INDEX SomeIndex ON SomeTable USING bloom (c1, c2, c3, c4);
-- a query which may take advantage of the Bloom index
SELECT * FROM SomeTable WHERE c2 = 5432 AND c4 = 1234;

The Bloom index is built by associating a small Bloom filter called a signature to each row of a table. For each indexed column value a few bits of the bit-array are turned on depending on the column number and its value passed through a hash function. These signatures are then aggregated and managed as an on-disk index.

When a search is performed with the index, the partial signature corresponding to the columns provided in the query is built, and then all rows the signature of which contains these bits are filtered out and checked. As the whole or a significant part of the index is scanned, it only makes sense if the index size is significantly smaller than the table size, which implies that the signature size must be significantly smaller than the row size.

Signature Size

In order to choose the index parameters, we build a simple model to understand their effects. Let \(s_s\) be the signature size, \(i\) denote the index columns, \(b_i\) the number of bits turned on for each column. Assuming that values in a column are evenly distributed and that the hash function is perfect, each bit of the signature may be 1 with probability: \[p(i, s_s) \, = \, 1 - \left( 1 - \frac{1}{s_s} \right)^{\sum_i b_i} \, \approx \, 1 - e^{-\frac{\sum_i b_i}{s_s}} \, \approx \, \frac{\sum_i b_i}{s_s}\]

The later approximation assumes that \(\sum_i b_i \ll s_s\). Then when a query on columns \(q\) is built, the selectivity \(\sigma\) of the index is the product of the probabilities of the signature bits that occur in the partial signature: \[\sigma(i, q, s_s) = p(i, s_s)^{p(q, s_s) \cdot s_s} \approx \left(\frac{\sum_i b_i}{s_s}\right)^{\sum_q b_q}\]

There is a trade-off between a large and precise Bloom index, which takes longer to be scanned but selects less rows to be examined later, and a smaller one which is scanned quickly but selects more rows to test further. With \(k\) the cost of reading a page sequentially and \(K\) the cost for a random page, with \(p_s\) the page size in bits, \(k_s\) the row pointer (eg page number and offset within), \(n\) the number of rows, then the cost of a search (we neglect log factors, we assume that there is one useful row per loaded page) is: \[k \, n \left( \frac{s_s + k_s}{p_s} \right) + K n \cdot \sigma(i, q, s_s)\]

The left term is the sequential scan of the index, the right term is the random accesses on table pages. By substituting, simplifying and zeroing the derivative of the formula, the cost is minimal for signature size: \[s_s = \left( p_s \frac{K}{k} \left( \sum_q b_q \right) \left( \sum_i b_i \right)^{\sum_q b_q} \right)^{\frac{1}{1+\sum_q b_q}} \approx 1.3 \, \left( \sum_i b_i \right)^{\frac{\sum_q b_q}{1+\sum_q b_q}} \left( p_s \, \frac{K}{k} \right)^{\frac{1}{1+\sum_q b_q}} = 1.3 \cdot I^{\frac{Q}{1+Q}} \cdot \left( p_s \frac{K}{k} \right)^{\frac{1}{1+Q}}\]

with \(I = \sum_i b_i\) and \(Q = \sum_q b_q\) the number of switched bits in the index and query columns. Constant 1.3 is an average value for \(Q^{\frac{1}{1+Q}}\) with \(Q\) small. The ideal signature size \(s_s\) is roughly proportional to the number of bits allocated to index columns \(I\), so as to maintain the selectivity. The more query column bits \(Q\) the smaller the signature, as more bits provide more filtering clues.

The \(\frac{K}{k}\) ratio compares random and sequential page read performances which varies widely depending on the technology, from maybe 200 on HDD down to 4 or less on SSD, or even 1 for data in cache: signature sizes can be much smaller on SSD, as filtering there does not need to be as effective as on HDD because the penalty of loading useless random pages is much reduced. The dependency on page size \(p_s\) may look strange, but it is really just a scaling parameter: if it is changed, then the sequential access cost \(k\) would be changed proportionaly, thus the \(\left( p_s \frac{K}{k} \right)\) factor in the formula would be pretty constant as well as the best signature size.

Note that this best signature size does not account the cost of storing and maintaining the index: a smaller size, although less effective, may be a better compromise. Taking this into account, we would really like to find the smallest signature size for which the cost is close to the minimum. Assuming a reduction factor \(r < 1\) on the signature size, the cost of scanning the index is reduced by that but the cost of accessing the table is increased by \(r^{-Q}\). To limit arbitrarily this increase to 30%, we can choose \(r\) down to: \[r = 1.3^{-\frac{1}{Q}}\]

Which leads to reduced signature size \(s_r\) defined as: \[s_r = r \cdot s_s = 1.3^{-\frac{1}{Q}} \cdot s_s = 1.3^{\frac{Q-1}{Q}} \cdot I^\frac{Q}{1+Q} \cdot \left( p_s \frac{K}{k} \right)^{\frac{1}{1+Q}}\]

In practice we find \(r \approx 0.95\) for typical query settings (a few columns, a few bits per columns).

Number of Column Bits

The formula above gives an indication for the signature size, which depends both on the number of bits of the index and query columns. The next step is to choose how to allocate bits between index columns. The minimal search cost depends on \(s_s\) so that the lower \(I\) and larger \(Q\) the better, which leads to a trade-off as \(Q \leq I\).

From an information theory perspective, the bits chosen in the signatures store part of the column values entropy that allows the selectivity of the index. Each bit is chosen within the \(s_r\) bits of the signature, thus may store about \(\log_2 s_r\) bits of entropy (a little less really because of collisions with other bits, say \(\log_2 s_r - I\), although \(I\) is not known). For a given index column, there is no point in trying to store more entropy than the column intrinsically holds. For columns with an entropy smaller than \(\log_2 s_r\), one bit transfers most of it into the index. Otherwise the more bits the more entropy is transfered, but with a decreassing efficiency. The following choice transfers about two-thirds of the column entropy to the index: \[b_i = \frac{\mbox{entropy}(\mathtt{ci})}{\log_2 s_r}\]

(More precisely \(b_i \cdot \log_2 s_r = - \ln (1 - f) \cdot \mbox{entropy}(\mathtt{ci})\) for transfering fraction \(f\) of entropy, under the assumption that \(\mbox{entropy}(\mathtt{ci}) \gg \log_2 s_r\). See end of page for a better approximation). Allocating bits in the signature for columns with small entropy, that is with few distinct or poorly distributed values, is probably not worth the space. The attentive reader will also have noticed the chicken or the egg issue brought with this formula: the number of bits per column depends on the signature size, but the signature size depends on the number of bits per column. Iterating once or twice may be necessary to converge.

Shannon entropy can be computed directly on a column by applying the famous \(- \sum_i p_i \log_2 p_i\) formula:

-- column entropy, in shannons (aka bits)
SELECT SUM(- pi * ln(pi) / ln(2)) AS "entropy"
 FROM ( -- frequency of column values
   SELECT 1.0 * COUNT(*) / (SELECT COUNT(*) FROM SomeTable)
   FROM SomeTable GROUP BY someColumn
 ) AS p(pi);

A UNIQUE column provides \(\log_2 n\) entropy, but as it is already indexed it does not need to appear again in a Bloom index.

A column holding uniformly distributed dates over 6 years, that is about 2200 distinct values, provides 11.1 bits of entropy. For the default 80 bits signature size which holds about \(\log_2 80 = 6.3\) bits of entropy per chosen bit, this suggest in \(11.1/6.3 \approx 2\) bits could allocated to such a column. Good news, this is the default.

A 1,000,000 element column with uniformly distributed values drawn from 1,000,000 values, as used in PostgreSQL bloom filter documentation example, provides about 19.1 bits of entropy, slightly less than the 20 bits of a unique column of the same size. Thus, for the default 80 bits signature again, then \(19.1/6.3 \approx 3\) bits should be allocated to such a column.

The relevant number of bits for a column is typically between 1 and 3. The resulting search cost, by substitution \(s_s\) into the initial cost function and adjusting constant factors, is proportional to: \[( 1.3 + 1.3^{-Q} ) \cdot I^\frac{Q}{1+Q} \cdot \left( p_s \frac{K}{k} \right)^\frac{1}{1+Q}\]

Now we are a little bit stuck, as the cost depends on the particulars of the indexed columns and of the queries. My advice is to explore starting from the entropy-derived value, and to check how changing the number of bits, especially those from columns less frequently occuring in queries, impacts the index size and search cost.

Choosing Parameters

Let us consider the 9 random integer columns in the documentation example, with a query on 2 columns. By using the default of 2 bits per column we have \(I = 18\) and \(Q = 4\). On SSD, this leads to: \[s_r = 1.3^\frac{3}{4} \cdot 18^{\frac{4}{5}} \cdot (8000 \cdot 4)^{\frac{1}{5}} \approx 98 \rightarrow 96 \ldots 112\]

But on HDD, we have a less interesting result as the best size computed is nearly as large as a table row: \[s_r = 1.3^\frac{3}{4} \cdot 18^{\frac{4}{5}} \cdot (8000 \cdot 200)^{\frac{1}{5}} \approx 214 \rightarrow 212 \ldots 228\]

On HDD, if the default signature size is used instead, the search with the index is 10 times more expensive, although that might still be better than a full table scan.

Let us consider a more favorable example: if we have 5 random columns, 2 of which are used in queries, then with 3 bits per column on SSD: \[s_r= 1.3^\frac{5}{6} \cdot 15^{\frac{6}{7}} \cdot (8000 \cdot 4)^{\frac{1}{7}} \approx 56 \rightarrow 64\]

And on HDD: \[s_r = 1.3^\frac{5}{6} \cdot 15^{\frac{6}{7}} \cdot (8000 \cdot 200)^{\frac{1}{7}} \approx 98 \rightarrow 96 \ldots 112\]

For this setting, using only 2 bits per column increases the signature size (61 on SSD and 134 on HDD vs 56 and 98 above) and the overall cost. However if the query would use 3 columns, the 2 bits per column would probably be better. The best choice is quite sensitive to the expected number of columns used in queries.

Experiments

Let us create a 1,000,000 row 57 MB table with 5 random integer columns:

CREATE TABLE foo(id SERIAL, c1 INT, c2 INT, c3 INT, c4 INT, c5 INT,
                 it TIMESTAMP NOT NULL DEFAULT NOW());
INSERT INTO foo(c1, c2, c3, c4, c5)
  SELECT random() * 1000000, random() * 1000000, random() * 1000000,
         random() * 1000000, random() * 1000000
  FROM generate_series(1, 1000000);

The following query takes 110.2 ms:

SELECT * FROM foo WHERE c1 = 171267 AND c4=65555;
 -- Seq Scan on foo
 --   Filter: ((c1 = 171267) AND (c4 = 65555))
 --   Rows Removed by Filter: 999999

After creating a 13 MB Bloom index on the five columns:

CREATE INDEX foo_c1c2c3c4c5
  ON foo USING bloom(c1, c2, c3, c4, c5)
  WITH (length=64, col1=3, col2=3, col3=3, col4=3, col5=3);

The same query takes 9.3 ms:

SELECT * FROM foo WHERE c1 = 171267 AND c4=65555;
 -- Bitmap Heap Scan on foo
 --   Recheck Cond: ((c1 = 171267) AND (c4 = 65555))
 --   Rows Removed by Index Recheck: 232
 --   Heap Blocks: exact=232
 --   ->  Bitmap Index Scan on foo_c1c2c3c4c5
 --       Index Cond: ((c1 = 171267) AND (c4 = 65555))

The index provides a 10 to 20 speedup to these queries on my SSD laptop.

Conclusion

The Bloom index benefit is sensitive to the number of query bits that take advantage of the filter selectivity. If the index selectivity is too low, for instance if only one column is used in the query, or if too few bits are allocated to index columns, the database will probably revert to scanning the whole table anyway.

The Bloom index can be quite effective if you have enough columns to index and you query on significant subsets of them. The main benefit is that one index is shared for searches on different columns, but it only works for a conjunction of equalities.

Finally this index advantage is enhanced on SSD where the random accesses penalty is reduced, which allows for smaller indexes.


Updates since initial publication: Add some links; Fix some formula; Better discuss the number of bits per column choice.

A more precise over-approximation of the entropy transfered per bit of signature is to compute the entropy of the number of bits probably turned on when trying all possible column values: \[\log_2 \left( s_r \cdot \left( 1 - e^{-\frac{2^{\mbox{entropy}(\mathtt{ci})}}{s_r}} \right) \right)\]

which is close to \(\log_2 s_r\) for a large enough column entropy.