CYBERTEC Logo

Faceting large result sets in PostgreSQL

12.2022 / Category: / Tags: |

While the term faceting may sound foreign to you, you almost certainly have run into it in your adventures online. It is those helpful little boxes that turn up when browsing in web shops or searching in catalogs, telling how to further narrow down the search results, (for example, by color) and how many items will be left. They are immensely helpful for users - so naturally you'll want to implement them in your application. This article will look into how to do that.

 

Conceptually, faceting has two steps, first: take a universe of things and narrow it down to the current search result. Let’s call this the base query. Second, extract from the matched set of items the attributes and values that could be used for narrowing down - also known as the facets.

Let's start with a schema of a document database.

And let's say you want to see distribution for:

  • Type
  • Category
  • Start and end timestamps
  • Size
  • Tags

Here, there are a few different kinds of facets. The simplest example is type, you can just use each different type as a facet and count up the occurrences in your result set, for example: type=application/pdf (1234). Another simpler one is category_id, there you can also directly count up the number of matching rows for each value the column can take. The difference from type is that you want to look up the name of the category for display purposes. Categories could also be hierarchical but for now let's just brush this aside as a “visualization concern”. We'll talk about other things you could do later.

What counts as a facet? Categorical and continuous variables

The common thing about type and category_id facets is that they are what are called categorical variables - they can take on a limited, and relatively small, set of values. You don’t have to do anything with them to be able to count them up as facets. The other attributes you are considering are called continuous variables, almost every row will have a unique value, but some values are close to each other and some are far. Creating one unique facet for each row is not particularly useful to users.

Bucketing for continuous variables

The typical way to handle continuous variables is to use bucketing. You divide the whole value space up into larger chunks and consider those chunks as a categorical variable. For example, for timestamps you could consider all entries from a single month or year to be the same, or for size you could just arbitrarily pick some ranges for small, medium and large documents.

Composite variables

Finally, tags is what could be called a composite variable, because you can split each result up into the multiple different facet values that it has. Conceptually they are not too different from categorical variables with the major differences being that counts of different categories do not add up to exactly the number of result rows. Also an important consideration is the inverse - selecting one tag does not exclude selecting others.

Calculate facets

A simple way to calculate facets is to just run your search query and then group by the facet value and count the results:

For tags you need a slightly more complicated query to unwrap the values, but the general shape is similar. The tags could also be stored in an association table for a more traditional data model, but that wouldn’t change what you are doing here much.

Downsides

This approach has a few downsides. The first is that you have to re-execute the query for each of the attributes that you want to facet on. There’s also a lot of repetition going on and it’s hard to make a generic user side handling of the results. Unfortunately there is no way to get multiple result sets out of a single query in PostgreSQL.

Another method

If you want to get all the facets in a single result set, you have to make some concessions. The SQL type system does not support variable data types, so you have to cast everything to text. But if you accept this, you can transform the query to a form where a lateral subquery returns all facets present in each matched row:

Here you are running your base query the same as always, but then for each row you create multiple intermediary rows of the form facet_name, facet_value. Then you just group by that pair and count up how many of each pair you found.

Faceting small vs big result set sizes

This approach works reasonably well at smaller result set sizes. You could optimize a bit by not wasting time counting facets that are already filtered down to one value, if you are willing to dynamically generate the query. However, when you start to get up in the tens to hundreds of thousands of rows, the response times start to become noticeable. At around half a million matching results, the execution times go above 1 second even if lots of parallel workers are allowed. If you have hundreds of millions of documents and are matching a large fraction of them, the execution time can easily go into minutes.

Magic performance?

So what happens if you have more things you want to count up? Are the big companies using black magic not available for mere mortals? Of course not. The trick to calculating facet counts quickly is to arrange data in a way that makes it really fast to intersect lists of rows matching some conditions and calculate the size of the result set. This is done using inverted indexes that store a list of matching documents for each facet value.

We implemented the same method in a PostgreSQL extension called pgfaceting that uses roaring bitmaps via a PostgreSQL wrapper extension. With this extension, you need to pre-define which facets need the index precalculated:

Behind the curtain -

...you create a “user space” inverted index table containing a list of matching documents for each facet value. (Composite facets are not yet supported, but you plan to add them soon) The index size can be pretty reasonable. The demo dataset you used has 100M documents and the inverted index ends up being 1% of the table size.

What about query performance?

Because this is an index on its own you can actually use it to quickly determine the result set of the base query. Thanks to this, even when you are selecting 60M rows you will be able to count up the results in 155ms. And this is not using any parallelism yet, just one core.

What's next!

In future blog posts, you'll get to see how to achieve this great performance for yourself, how hierarchical data can be handled, how a parallel index build is done, how you enable incremental maintenance - and much more.

Since the pgfaceting extension has just been released, the query API is a stand-in for a thought-through implementation; for the first couple of releases, you can expect some API changes, needing to rebuild between upgrades and other rough edges. Please do take a look and leave feedback for how you would want to use it to help us shape the API. And as always with open source, hands-on help is also appreciated.

 


In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.

0 0 votes
Article Rating
Subscribe
Notify of
guest
1 Comment
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Illyana Dotson
Illyana Dotson
1 year ago

CYBERTEC Logo white
CYBERTEC PostgreSQL International GmbH
Römerstraße 19
2752 Wöllersdorf
Austria

+43 (0) 2622 93022-0
office@cybertec.at

Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2024
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle
    1
    0
    Would love your thoughts, please comment.x
    ()
    x
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram