By Kaarel Moppel – A new feature called Bloom indexing (an implementation of Bloom filters) slipped in somewhat quietly with the latest 9.6 release. New index methods are pretty rare. At the moment of the release, I of course looked through the 9.6 release notes, spent a minute there, and thought – a new index type, great!

But to be honest, the explanations were not too detailed. I didn’t exactly understand all the bits involved, leaving me with an impression that it can only tell Postgres which search value(s) “do not exist” in a table, making its use limited.  But now I took some time look deeper into this new functionality and I was positively surprised – it is actually a fully working and potentially very useful index method for real life problems, that one should know about.

Working principles of the Bloom index – and when to apply it?

First, we should explain what’s the intended use case for Bloom filters/indexes – when should we use it? Short answer – when we have a lot of columns and we’re also using many of those columns in kind of random combinations in queries involving the equality operator (“=”), e.g. something like “SELECT * FROM tbl WHERE col1 = 2 AND col9 = ‘x’ AND col17 = 865”. And note the accent on the “equality operator” part – as for Bloom, on the mathematical side, hashes of values are stored/compared, and range operations don’t make any sense there.

Basically Postgres calculates a hash for each of your column values and then stores some bits out of each hash as one index entry, together with the row’s physical location info (as with every other index). And exactly this “merging” of many column values into one index entry, resulting in a signature in our Bloom context, is where the effectiveness of this index type comes to shine.

In short – it can help you to save a lot of disk space! Thus instead of 10 separate normal B-tree indexes you can now have only one Bloom index that’s though lossy, meaning it won’t give you perfect accuracy as matched values need to be always re-checked from the table, but from a probabilistic viewpoint it is “good enough” to be useful. If of course applied and configured correctly – and hopefully after reading through this Blog it shouldn’t be a problem:) But okay, given we have thought about our application now and have a hunch about a legit use case for Bloom – how does it work?

Basic usage for Bloom

So here’s how it goes in the simplest form:


CREATE EXTENSION bloom;  -- make sure “contrib” is installed



CREATE INDEX bloomidx ON tbloom USING bloom (i1,i2,i3)

WITH (length=80, col1=2, col2=2, col3=4);

Looks somewhat similar to GIN/GiST indexes so far. Create the extension, define your index as making use of the Bloom index method, and basically start querying. (Hope that Postgres will use the index!) And remember, currently Postgres only supports equality comparisons, with the additional limitation that it also supports only integer and text data types. This shouldn’t be too much of a show-stopper for human input queries. We’re not too good at remembering/inputting floating point numbers and millisecond precision timestamps.

But what on earth do those extra parameters in the WITH part mean? The documentation says:

The index is created with a signature length of 80 bits, with attributes i1 and i2 mapped to 2 bits, and attribute i3 mapped to 4 bits. We could have omitted the length, col1, and col2 specifications since those have the default values.

Things get fuzzy here … how does it add up exactly? 80 bits total is mapped to only 2×2 bits + 4 bits. Eh? Couldn’t find any good examples or explanations by googling also, so – taram taram taraa … decided to go for the absolute truth – meaning hitting the source code for the extension.

Going deeper

So after an hour of snooping around in the code (and grinding my teeth, simultaneously feeling a lot of respect for the authors as this stuff is pretty clever in the end) I think I got it figured out and I also gave a pet name to the algorithm in the process, which could well be named “wicked semi-random bit twisting modulo game” 😉 For those who want to check it out themselves, the “light bulb on” parts of code can be found from the signValue() function in blutils.c.

Here’s a small excerpt showing which signature bits are set for a single column value:


/*

* Init hash sequence to map our value into bits. the same values in

* different columns will be mapped into different bits because of step

* above

*/

hashVal = DatumGetInt32(FunctionCall1(&state->hashFn[attno], value));

mySrand(hashVal ^ myRand());



for (j = 0; j < state->opts.bitSize[attno]; j++)

{

/* prevent multiple evaluation in SETBIT macro */

nBit = myRand() % (state->opts.bloomLength * SIGNWORDBITS);

SETBIT(sign, nBit);

}

So what does it actually tell us then? We see that there’s quite a bit of randomness involved. Luckily, it answers my question about 2×2 bits + 4 bits adding up to 80 bits. How it works is that for every single indexed column and column value, Postgres calculates separate bit storing locations! (Two locations by default, but user-configurable in the WITH part.) This basically guarantees that the whole available bit range (80 bits by default) fills up in an uniform way, making good use of all of our bits. The column index serves as a randomness “seed”. Normally, from a security point of view, not something we would like to do, but here it makes perfect sense. With a normal/random seed, indexing and index comparisons wouldn’t be predictable. Quite an elegant solution in the end!

More on Bloom

Hope you followed me so far … it’s probably enough material to digest for one blog post. Since we’re now familiar with the global concept and the inner mechanics, I plan to follow up with a sample test case: Read on for info on performance and other factors in the next post. We’re ready when you are ready!