Row store vs. column store – a lot has been written about this topic in the context of PostgreSQL and Citus. What does it really mean and what are the implications? Are column stores “always cool” and “always beneficial”? No, there’s more to it – which requires a closer look. When trying to understand the benefits of column store over row store or vice versa, it’s important to grasp the basic ideas which lead to the database behavior we’re going to see in this blog.

What does “row store” vs. “column store” mean?

Let’s use a practical example to understand. Consider the following:

SELECT first_name, last_name FROM person WHERE id = 10;

What will happen behind the scenes: PostgreSQL will not just look for 3 columns (first_name, last_name and id). It will instead communicate with the disk system using 8k blocks. Inside a single block, PostgreSQL stores more than just one row including visibility and so on. In a classical OLTP application (bookkeeping, financial transactions, address search, etc.) we usually need the entire row. Just imagine you are searching for an address – usually all fields are displayed and it is highly beneficial to fetch the entire information from disk in one go. That is where a row store is the best option.

However, what about analytical use cases? Row store or column store?

Suppose we want to sum up our entire sales for the past 20 years. Do we need the name of those products we sold? Do we need the quantity and the place where we sold it? Actually, we do not. All we need is 1 out of, say, 20 columns. If we want to add up 100 billion rows it makes sense to just read information worth 1 column as opposed to the entire data. Reading 20 columns worth of data to throw 19 columns away because they are not needed for our purpose leads to a lot of I/O and adds a ton of overhead (such as finding the column we need in the row and so on). Analytics and BI are the use cases where column stores shine.

However, suppose we store every column of a table in a single data file and we want to find all columns belonging to a single address? In this case we have to read 20 data files and recompile the row in the most cumbersome way possible. This is when row stores shine.

As you can see, there is no technique that “fits all”. Depending on the use case, different storage formats are required.

Comparing row stores and Citus column stores

We can easily compare row stores and column stores: create two identical tables in different storage formats. The important difference is simply the storage method (USING columnar):

postgres=# CREATE TABLE t_row_timeseries (
id 		serial, 
tstamp 	timestamptz, 
val 		float
);
CREATE TABLE
postgres=# CREATE TABLE t_col_timeseries (
id 		serial, 
tstamp 	timestamptz, 
val 		float
) USING columnar;
CREATE TABLE

Then we distribute the table as usual:

postgres=# SELECT create_distributed_table(
't_row_timeseries', 
'tstamp'
);
 create_distributed_table 
--------------------------
 
(1 row)

postgres=# SELECT create_distributed_table(
't_col_timeseries', 
'tstamp'
);
 create_distributed_table 
--------------------------
 
(1 row)

Now we want to load some timeseries data.

For the purpose of this test, we will load 157 million rows. This is equal to one row per second for five years: 86400 * 365 * 5 = 157.680.000. In other words, this is already a relevant but realistic amount of data.

postgres=# INSERT INTO t_row_timeseries (tstamp, val) 
SELECT '2020-01-01'::timestamp + (id || ' seconds')::interval,
 	  id % 16 
FROM 	  generate_series(1, 86400*365*5) AS id;
INSERT 0 157680000
Time: 104624.127 ms (01:44.624)


postgres=# INSERT INTO t_col_timeseries (tstamp, val) 
SELECT '2020-01-01'::timestamp + (id || ' seconds')::interval, 
  id % 16 
FROM 	  generate_series(1, 86400*365*5) AS id;
INSERT 0 157680000
Time: 105505.957 ms (01:45.506)

What is interesting here is that the time we need to load the data is pretty similar. Most of the CPU cycles are burned to generate the data in the first place. The rest is not so important, so loading data is not really an issue. We generate and load around 1.5 million rows per second on my local Mac Mini (M1) on OS (which is not the fastest OS for database work at all).

When we inspect the Citus system table, we instantly see the major difference in size between these two storage methods:

postgres=# SELECT table_name, table_size, access_method 
FROM 	  citus_tables ;
    table_name    | table_size | access_method 
------------------+------------+---------------
 t_col_timeseries | 1651 MB    | columnar
 t_row_timeseries | 7849 MB    | heap
(2 rows)

The column store is around 5 times smaller than the normal PostgreSQL table, which is pretty much what we expected anyway. A lot of trickery which pays off big time can be applied with this storage technique. While hard-drives have become bigger, cheaper and faster, the effect of this technological progress has been eaten away financially, by the cost of cloud storage, and technically, by virtualization and network storage. In short: It does matter that our table is smaller and more compact, for many reasons.

On the performance side of things, we can see a major improvement over classical row storage:

postgres=# SELECT count(*) FROM t_row_timeseries;
   count   
-----------
 157680000
(1 row)
Time: 3028.839 ms (00:03.029)

postgres=# SELECT count(*) FROM t_col_timeseries;
   count   
-----------
 157680000
(1 row)
Time: 979.112 ms

The query is 3 times faster than before.

In reality, even more performance improvement is possible, but as always, things depend a bit on the type of query you are running.
Running slightly more advanced SQL

However, there are many more options than just a simple count.

What if we want to do a little grouping? As expected, we will again see major performance gains:

postgres=# SELECT val, count(*) 
FROM 	  t_row_timeseries 
GROUP BY 1 
ORDER BY 1 
LIMIT 3;
 val |  count  
-----+---------
   0 | 9855000
   1 | 9855000
   2 | 9855000
(3 rows)
Time: 3764.688 ms (00:03.765)

postgres=# SELECT val, count(*) 
FROM 	  t_col_timeseries 
GROUP BY 1 
ORDER BY 1 
LIMIT 3;
 val |  count  
-----+---------
   0 | 9855000
   1 | 9855000
   2 | 9855000
(3 rows)
Time: 2519.854 ms (00:02.520)

The difference is not as big as before, because a lot more time is burned on CPU. I/O time is lower compared to the overall time needed by the CPU to process this information. The simple rule is: The more CPU-intensive things are, the less relevant storage size becomes on the performance side.

There is something else you need to keep in mind. Consider an even fancier query:

postgres=# SELECT val, count(*) 
FROM 	  t_row_timeseries 
GROUP BY ROLLUP (val) 
ORDER BY 1;
ERROR:  could not run distributed query with GROUPING SETS, CUBE, or ROLLUP
HINT:  Consider using an equality filter on the distributed table's partition column.
Time: 2.225 ms

At the time of publishing this blog, not every operation is possible.

Some operations are only possible on local storage and are hard to achieve in a distributed environment. This has to be kept in mind when making design decisions. Don’t expect everything to work just like before by just plugging in more systems – that can backfire, because inevitably, features will be lost.

Finally …

 


In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.