While the term faceting may sound foreign to you, you almost certainly have run into it in your adventures online. It is those helpful little boxes that turn up when browsing in web shops or searching in catalogs, telling how to further narrow down the search results, (for example, by color) and how many items will be left. They are immensely helpful for users – so naturally you’ll want to implement them in your application. This article will look into how to do that.
Conceptually, faceting has two steps, first: take a universe of things and narrow it down to the current search result. Let’s call this the base query. Second, extract from the matched set of items the attributes and values that could be used for narrowing down – also known as the facets.
Let’s start with a schema of a document database.
CREATE TABLE documents ( id int4 primary key, created timestamptz not null, finished timestamptz, category_id int4 not null, tags text, type mimetype, size int8, title text );
And let’s say you want to see distribution for:
- Start and end timestamps
Here, there are a few different kinds of facets. The simplest example is
type, you can just use each different type as a facet and count up the occurrences in your result set, for example:
type=application/pdf (1234). Another simpler one is
category_id, there you can also directly count up the number of matching rows for each value the column can take. The difference from type is that you want to look up the name of the category for display purposes. Categories could also be hierarchical but for now let’s just brush this aside as a “visualization concern”. We’ll talk about other things you could do later.
What counts as a facet? Categorical and continuous variables
The common thing about
category_id facets is that they are what are called categorical variables – they can take on a limited, and relatively small, set of values. You don’t have to do anything with them to be able to count them up as facets. The other attributes you are considering are called continuous variables, almost every row will have a unique value, but some values are close to each other and some are far. Creating one unique facet for each row is not particularly useful to users.
Bucketing for continuous variables
The typical way to handle continuous variables is to use bucketing. You divide the whole value space up into larger chunks and consider those chunks as a categorical variable. For example, for timestamps you could consider all entries from a single month or year to be the same, or for size you could just arbitrarily pick some ranges for small, medium and large documents.
tags is what could be called a composite variable, because you can split each result up into the multiple different facet values that it has. Conceptually they are not too different from categorical variables with the major differences being that counts of different categories do not add up to exactly the number of result rows. Also an important consideration is the inverse – selecting one tag does not exclude selecting others.
A simple way to calculate facets is to just run your search query and then group by the facet value and count the results:
SELECT type, COUNT(*) FROM documents WHERE ... GROUP BY 1; SELECT category_id, COUNT(*) FROM documents WHERE ... GROUP BY 1; SELECT date_trunc('month', started), COUNT(*) FROM documents WHERE ... GROUP BY 1; SELECT date_trunc('month', finished), COUNT(*) FROM documents WHERE ... GROUP BY 1; SELECT width_bucket(size, array[0,1000,5000,10000,50000,100000,500000]), COUNT(*) FROM documents WHERE ... GROUP BY 1;
tags you need a slightly more complicated query to unwrap the values, but the general shape is similar. The tags could also be stored in an association table for a more traditional data model, but that wouldn’t change what you are doing here much.
SELECT tag, COUNT(*) FROM documents, LATERAL unnest(tags) tag WHERE ... GROUP BY 1;
This approach has a few downsides. The first is that you have to re-execute the query for each of the attributes that you want to facet on. There’s also a lot of repetition going on and it’s hard to make a generic user side handling of the results. Unfortunately there is no way to get multiple result sets out of a single query in PostgreSQL.
If you want to get all the facets in a single result set, you have to make some concessions. The SQL type system does not support variable data types, so you have to cast everything to text. But if you accept this, you can transform the query to a form where a lateral subquery returns all facets present in each matched row:
SELECT facet_name, facet_value, COUNT(*) FROM documents, LATERAL (VALUES ('type', type::text), ('category_id', category_id::text), ('created', date_trunc('month', created)::text), ('finished', date_trunc('month', finished)::text), ('size', width_bucket(size, array[0,1000,5000,10000,50000,100000,500000])::text) UNION ALL SELECT ) facets(facet_name, facet_value) WHERE ... GROUP BY 1, 2;
Here you are running your base query the same as always, but then for each row you create multiple intermediary rows of the form
facet_name, facet_value. Then you just group by that pair and count up how many of each pair you found.
Faceting small vs big result set sizes
This approach works reasonably well at smaller result set sizes. You could optimize a bit by not wasting time counting facets that are already filtered down to one value, if you are willing to dynamically generate the query. However, when you start to get up in the tens to hundreds of thousands of rows, the response times start to become noticeable. At around half a million matching results, the execution times go above 1 second even if lots of parallel workers are allowed. If you have hundreds of millions of documents and are matching a large fraction of them, the execution time can easily go into minutes.
So what happens if you have more things you want to count up? Are the big companies using black magic not available for mere mortals? Of course not. The trick to calculating facet counts quickly is to arrange data in a way that makes it really fast to intersect lists of rows matching some conditions and calculate the size of the result set. This is done using inverted indexes that store a list of matching documents for each facet value.
We implemented the same method in a PostgreSQL extension called pgfaceting that uses roaring bitmaps via a PostgreSQL wrapper extension. With this extension, you need to pre-define which facets need the index precalculated:
CREATE EXTENSION pgfaceting; SELECT faceting.add_faceting_to_table( 'documents', key => 'id', facets => array[ faceting.datetrunc_facet('created', 'month'), faceting.datetrunc_facet('finished', 'month'), faceting.plain_facet('category_id'), faceting.plain_facet('type'), faceting.bucket_facet('size', buckets => array[0,1000,5000,10000,50000,100000,500000]) ] );
Behind the curtain –
…you create a “user space” inverted index table containing a list of matching documents for each facet value. (Composite facets are not yet supported, but you plan to add them soon) The index size can be pretty reasonable. The demo dataset you used has 100M documents and the inverted index ends up being 1% of the table size.
What about query performance?
Because this is an index on its own you can actually use it to quickly determine the result set of the base query. Thanks to this, even when you are selecting 60M rows you will be able to count up the results in 155ms. And this is not using any parallelism yet, just one core.
SELECT facet_name, count(distinct facet_value), sum(cardinality) FROM faceting.count_results('documents'::regclass, filters => array[row('category_id', 24)]::faceting.facet_filter) GROUP BY 1; facet_name | count | sum ------------+-------+---------- created | 154 | 60812252 finished | 154 | 60812252 size | 7 | 60812252 type | 8 | 60812252 (4 rows) Time: 155.228 ms
In future blog posts, you’ll get to see how to achieve this great performance for yourself, how hierarchical data can be handled, how a parallel index build is done, how you enable incremental maintenance – and much more.
Since the pgfaceting extension has just been released, the query API is a stand-in for a thought-through implementation; for the first couple of releases, you can expect some API changes, needing to rebuild between upgrades and other rough edges. Please do take a look and leave feedback for how you would want to use it to help us shape the API. And as always with open source, hands-on help is also appreciated.