Christmas is approaching fast so I thought I’d give PostgreSQL users some inspiration when it comes to buying their Christmas presents. I assume that I am not the only one who has constant troubles finding the right present for somebody. The trouble is that prices are usually strange numbers. Maybe EUR 49.95 or EUR 49.99. This makes it very hard to look for something that costs, say, EUR 50.

Let us assume you want to buy something that costs around EUR 50. Of course, if there is nothing for EUR 49.95 or 48.99 – but given the fact that it is Christmas – you are also fine with EUR 54.99.

**KNN comes to the rescue**

Using the = operator is clearly not enough to fixing your problem. What you really want is some kind of fuzzy search. The trivial approach to this database problem is to just look for a range of values. In many cases this seems like a good idea, but remember: If you happen to be a large website, looking for everything between EUR 40 and EUR 60 might yield hundreds of thousands of products. This is clearly not an option. All you want is a handful of suggestions, which are as close as possible to your desired target price.

Finding something close to what you are looking for is exactly what KNN-searc (K Nearest Neighbor Search) has been made for. It allows you to break the chains of the = operator and allows fuzzy search.

**An example**

To show you how you can use PostgreSQL to search for similar numbers, we can create a table containing some randomly generated entries:

test=# CREATE TABLE t_random AS SELECT (random() * 100) AS r FROM generate_series(1, 10000); SELECT 10000 test=# SELECT * FROM t_random LIMIT 10; r ------------------ 58.10591513291 22.2964704036713 89.8202452808619 54.2416563257575 81.23788847588 70.1738043688238 7.34952366910875 41.887197131291 13.8173459097743 7.20933876000345 (10 rows)

The generate_series function is a nice and easy way to generate a list of values. As you can see, we have evenly distributed values between 0 and 100 – exactly what we need for our test.

**The trivial approach: Ranges**

As mentioned before, most people would go for the trivial approach here and just come up with a range. To do so we first create an index:

test=# CREATE INDEX idx_r ON t_random (r); CREATE INDEX

Then we can just do the query:

test=# explain SELECT * FROM t_random WHERE r BETWEEN 45 AND 55; QUERY PLAN ----------------------------------------------------------------------------------- Bitmap Heap Scan on t_random (cost=21.99..81.24 rows=950 width=8) Recheck Cond: ((r >= 45::double precision) AND (r <= 55::double precision)) -> Bitmap Index Scan on idx_r (cost=0.00..21.75 rows=950 width=0) Index Cond: ((r >= 45::double precision) AND (r <= 55::double precision)) (4 rows)

Just take a look at the estimates of the PostgreSQL optimizer: We are expected to need 950 rows for this query. This is not surprising as it is close to 10% of the data. Just imagine doing that on 1 mio rows – you had to read 100.000 rows to display just a handful of them. Clearly – performance will go down the drain in this case.

A second problem is: What if there are no products in the range of 45 – 55? What if the cheapest product is EUR 56?

**Supersonic fast KNN search**

KNN can come to the rescue here.

To use KNN we first have to install the extension containing some fancy operator classes:

test=# CREATE EXTENSION btree_gist; CREATE EXTENSION

Then we can deploy a Gist index capable of doing KNN:

test=# CREATE INDEX idx_knn ON t_random USING gist(r); CREATE INDEX

The module will introduce the so called “distance operator” (<->). It measures the distance between the value in doubt and the target we are looking for. The beauty of the PostgreSQL KNN mechanism is that we can sort by this distance to fetch those rows in the order we desire:

test=# explain analyze SELECT *, 50 <-> r FROM t_random ORDER BY 50 <-> r LIMIT 10 ; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.00..0.63 rows=10 width=8) (actual time=0.264..0.284 rows=10 loops=1) -> Index Scan using idx_knn on t_random (cost=0.00..628.25 rows=10000 width=8) (actual time=0.263..0.279 rows=10 loops=1) Order By: (r <-> 50::double precision) Total runtime: 0.367 ms (4 rows)

Our query just takes a fraction of a millisecond. And more important: Runtimes are pretty stable – even if the number of rows grows.

On behalf of my entire team, we wish you all a Merry Christmas time and efficient Christmas shopping 😉