There are certain types of questions which accompany any PostgreSQL consultant throughout his or her entire career. Invariably, one of those questions is: “How can we compare the content of two tables?” This is an important thing to know, because we might want to ensure that data is identical after a migration, or perhaps we want to verify that data has not been corrupted during storage. These questions arise so frequently that it was necessary to come up with a speedy solution for comparing the content of tables. The answer lies in creating checksums for tables.

md5sum on Linux

On a Linux system you normally use a program such as md5sum to get the checksum, or “digital fingerprint”, of a file. Here’s how it works:

[hs@fedora ~]$ md5sum /tmp/file.txt
d41d8cd98f00b204e9800998ecf8427e  /tmp/file.txt

md5sum is a widely-used method to obtain checksums. Checksums can be compared more easily than entire files – especially when those files are large.

The question which naturally arises is: How can we achieve the same thing in a database? What can we do in SQL to create such a checksum?

Ordered sets in SQL

The solution to the problem in SQL is to use an “ordered set”. Now what is that? Here is an example of a simple ordered set:

test=# SELECT id % 2 AS grp, 
            percentile_disc(0.5) WITHIN GROUP (ORDER BY id)
       FROM generate_series(1, 43243) AS id
GROUP BY 1;
grp | percentile_disc
----+-----------------
0   |           21622
1   |           21621
(2 rows)

This is the SQL way to calculate a median value. What happens is: The generate_series function creates values from 1 to 43243 (one row per value). The GROUP BY clause dissects them into 2 groups. One group will contain odd numbers and the other group will contain even numbers. The trick is: If we calculate the average (mean) number, the order of data does not matter. Suppose we have 1, 2, 3, 4, 5: The average will always be 3 – regardless of the order. However, how does a median work? We order the data and then we go 50% into the ordered data set – percentile_disc(0.5) does exactly that. Finding the “middle” value does not work without order.

The same is true for checksums. If we want to create a fingerprint for a table, the order of entries does matter. Identical data in a different order will lead to different results – and we want to avoid that.

Checksums in PostgreSQL

To solve this problem, we implement an ordered set which consumes sorted data and produces a checksum.

Before we implement the ordered set we need to see what’s already available in PostgreSQL:

test=# SELECT md5('hans');
               md5
----------------------------------
 f2a0ffe83ec8d44f2be4b624b0f47dde
(1 row)

A simple md5 function is available, but there’s no corresponding aggregate. We need to define one. Here’s how it works:

CREATE FUNCTION md5_agg_sfunc(text, anyelement) 
       RETURNS text
       LANGUAGE sql
AS
$$
  SELECT md5($1 || $2::text)
$$;
CREATE  AGGREGATE md5_agg (ORDER BY anyelement)
(
  STYPE = text,
  SFUNC = md5_agg_sfunc,
  INITCOND = ''
);

An “ordered set” needs an INITCOND. We initialize an empty string. The general idea is: We take a string, add the next row and turn it into a checksum. Then we add a row, turn it into a checksum and so on. We do that as long as there is data. Basically, it is a checksum over checksums, to ensure that the amount of data we’re working on is never too large to fit into memory. For every row in the loop, we call SFUNC (in our case, md5_agg_sfunc).

Order

In the context of an ordered set, one thing is of great importance: You might have guessed it – it is order. We use a little trick here:

test=# CREATE TABLE t_test (a int, b int);
CREATE TABLE

test=# INSERT INTO t_test
       SELECT   x, x + 10
       FROM     generate_series(1, 1000) AS x;
INSERT 0 1000

After loading some data, we can identify what’s possible:

test=# SELECT * FROM t_test ORDER BY t_test LIMIT 3;
 a | b
---+----
 1 | 11
 2 | 12
 3 | 13
(3 rows)

Note that we’re actually sorting by the “table”. What that means is that we’re sorting by every column (so in our case, ORDER BY a, b). This is important, as you will see in a moment.

It’s also possible is to use an entire row just like a column:

test=# SELECT t_test FROM t_test ORDER BY t_test LIMIT 3;
 t_test
--------
 (1,11)
 (2,12)
 (3,13)
(3 rows)

That is convenient: Let’s recap the definition of the aggregate:

CREATE  AGGREGATE md5_agg (ORDER BY anyelement)...

What we see here is that the aggregate can basically order by “anything”. So, as in this case, we fully exploit that. As long as a row can be sorted, we can use it for our checksums:

test=#  SELECT md5_agg() WITHIN GROUP (ORDER BY t_test) FROM t_test;
                md5_agg
----------------------------------
 d8586008a105a1cac121818f30841eec
(1 row)

We’ve successfully created a checksum for this table. Of course, we can also checksum subsets of data:

test=# SELECT 
  a % 2 AS grp,
  md5_agg() WITHIN GROUP (ORDER BY t_test)
FROM t_test
GROUP BY a % 2;

 grp |             md5_agg
-----+---------------------------------- 
   0 | 7e310dd10088ae69ab8d8a2b66e36aa2
   1 | 62f385154b241e74be8c91de150550c6
(2 rows)

We can always use a GROUP BY statement to split the data into groups.

Finally…

Creating checksums can be incredibly useful to help ensure the quality of your data. However, there is more. If you really want to protect your data and ensure maximum security, we suggest taking a look at PostgreSQL TDE which offers a way to encrypt data on disk. Check out our blog post about managing encryption keys with PostgreSQL TDE to read more.

 


In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.