CYBERTEC Logo

For many years, MySQL and PostgreSQL were competing databases, which addressed slightly different audiences. In my judgement (which is of course a bit biased) PostgreSQL always served my professional users, while MySQL had a strong standing among web developers and so on. But, after Oracle took over MySQL I had the feeling that the star of MySQL was somewhat fading.

Since the introduction of MariaDB I have the personal feeling that the technology is not really finished yet and that people are still asking for MySQL / MariaDB and are interested in comparing PostgreSQL to those technologies. Many people are not happy with statements such as “PostgreSQL is just better”. Arguments like that tend to be no brainers and we (the PostgreSQL community) are really able to do a lot better. So I thought I'd bring up some good ideas, why it is actually more desirable to use PostgreSQL in favor of MySQL / MariaDB.

The idea of this post is to collect some examples, which show on a technical level, why you might want to go for PostgreSQL. Keep in mind: This posting is simply a “best of” and by far not a complete list. However, it should be enough to convince people and help decision makers:

PostgreSQL vs. MySQL / MariaDB: Data type handling

MySQL / MariaDB PostgreSQL (any version)
MariaDB [test]> CREATE TABLE data (

id         integer NOT NULL,

data     numeric(4, 2)

);

Query OK, 0 rows affected (0.02 sec)

 

MariaDB [test]> INSERT INTO data

VALUES (1, 1234.5678);

Query OK, 1 row affected, 1 warning (0.01 sec)

 

MariaDB [test]> SELECT * FROM data;

+----+-------+

| id | data  |

+----+-------+

|  1 | 99.99 |

+----+-------+

1 row in set (0.00 sec)

test=# CREATE TABLE data (

id         integer NOT NULL,

data     numeric(4, 2)

);

CREATE TABLE

test=# INSERT INTO data

VALUES (1, 1234.5678);

ERROR:  numeric field overflow

DETAIL:  A field with precision 4, scale 2 must round to an absolute

value less than 10^2.

What you see here, is that PostgreSQL errors out because the data you are trying to put into the table does not conform to the data type of the column, while MySQL / MariaDB will silently change your data. Yes, MySQL / MariaDB do issue a warning but at the end of the day the data in the table is not what it is supposed to be.

Why does that matter? Suppose you have won the lottery. Instead of one million you just ended up getting 9999.99 because the field was not large enough. Unfortunately nobody noticed. Would you be angry? I would. The behavior shown by PostgreSQL in this example is correct for a variety of reasons:

If you shot yourself in the head there is no point in having a warning that “you might be dead now” - it is better to prevent the thing from happening in the first place.

Lesson learned: PostgreSQL takes your data structures seriously.

PostgreSQL vs. MySQL / MariaDB: Taking existing data seriously

MySQL / MariaDB PostgreSQL (any version)
MariaDB [test]> ALTER TABLE data

MODIFY data numeric(3, 2);

Query OK, 1 row affected, 1 warning (0.06 sec)

Records: 1  Duplicates: 0  Warnings: 1

 

MariaDB [test]> SELECT * FROM data;

+----+------+

| id | data |

+----+------+

|  1 | 9.99 |

+----+------+

test=# INSERT INTO data VALUES (1, 34.5678);

INSERT 0 1

test=# SELECT * FROM data;

id | data

----+-------

1 | 34.57

(1 row)

 

test=# ALTER TABLE data

ALTER COLUMN data TYPE numeric(3, 2);

ERROR:  numeric field overflow

DETAIL:  A field with precision 3, scale 2 must round to an absolute

value less than 10^1.

Again, MySQL / MariaDB will actually change your data. Remember, the original value inserted into the table was 1234.5678. Now we got 9.99. Do you notice the "minor difference"? If this was your bank account? Would you care? If this database was powering the machine keeping a sick child alive - would you still trust it, or would you demand a replacement? If your life depends on data – would you accept a silent change in the content of a table? Maybe I am too conservative, but I would not.

PostgreSQL does what I personally expect from a database. It will error out and inform us that the change is really not possible because there is data, which simply has to be protected. If you ever want the data in the table to be changed (in case the new rules are violated), you have to tell PostgreSQL explicitly what you want:

PostgreSQL
test=# ALTER TABLE data

ALTER COLUMN data

TYPE numeric(3, 2)

USING (data / 10);

ALTER TABLE

In this case we told PostgreSQL extactly how data has to be treated. PostgreSQL does not try to be smart (and fail such as MySQL / MariaDB), it does not try to do something fancy – it simply does what YOU want and what is good for your data.

PostgreSQL vs. MySQL / MariaDB: NULL handling does matter

MySQL / MariaDB PostgreSQL (any version)
MariaDB [test]> UPDATE data

SET id = NULL WHERE id = 1;

Query OK, 1 row affected, 1 warning (0.01 sec)

Rows matched: 1  Changed: 1  Warnings: 1

 

MariaDB [test]> SELECT * FROM data;

+----+------+

| id | data |

+----+------+

|  0 | 9.99 |

+----+------+

1 row in set (0.00 sec)

test=# UPDATE data

SET id = NULL

WHERE id = 1;

ERROR:  null value in column "id" violates not-null constraint

Do you remember that there is actually a NOT NULL constraint on the “id” column? MySQL / MariaDB don't care and just set the value to 0. The point: 0 and NULL are definitely not the same thing. NULL means “undefined” or “we don't know the value”. This makes a lot of sense in the real world: If you don't know how much money you got in your pocket, it definitely does not mean that your pocket is empty. If you don't know how many beers you had, it does not mean that there is “zero” alcohol in your blood – you might actually have just fallen off a chair.

PostgreSQL carefully distinguishes between 0, empty string, and NULL. Those things are simply not the same and PostgreSQL will always respect your constraints.

Deciding on a database engine

For me there is no question whether to use MySQL / MariaDB or PostgreSQL. The arguments outlined in this posts are strong evidence for me personally that MySQL is not an option. However, this is a free world and people can make decisions for themselves and will hopefully decide wisely.

Some weeks ago at pgDay Paris, during the evening social event, we got into a small "guestimation" with another Postgres enthusiast, about the precepted speed of using integer based ID-s vs UUID based ones for bigger amounts of data. In short he reasoned that one should generally avoid the UUID-s for performance reasons as they take 16 bytes instead of 4/8 bytes for normal integer and thus are slower. While it is true that they do occupy some more bytes on the disk (and are a bit rare in the wild as simpler applications just don't need them) I was still sceptical about the direct translation of those bytes to speed difference. More specifically we talked about bigger joins (full table scans), as on indexed single row operations there shouldn't be any difference anyways. In the end it was hard to say anything meaningful as I was just guessing - but now I finally got to testing it out with a simple test setup and am now laying out the numbers for you to judge also. In addition to UUID I also included int8 and numeric for completeness.

Test setup

I created 4 pairs of tables with 5 million rows with same ID-s in both of the tables, that would result in a "complete" join. Then I had to make a difficult choice – as the discussion rose from real life situations, it would be pretty meaningless to have only one table consisting of the ID column. Thus, to make the test a bit more real, I decided to add 6 other columns (2 ints, 2 timestamps, 2 short texts) so  that the extra 4 bytes wouldn't have such a huge effect on the outcome.  Then I launched a new test cluster with the latest Postgresql version 9.6.2, configured the shared_buffers=2GB, work_mem=1GB and maintenance_work_mem=1GB so that the join would be performed in the memory and also enabled the pg_stat_statements extension. Then I left the test script to run over night on my laptop (64-bit Ubuntu 14.04, 16 GB RAM, i5-6200U @ 2.30 GHz, SSD), doing
1) 1000 join iterations for every data type and
2) 100 index building iterations.

The scripts can be found here if you want to try it out yourself. Note that I'm using UNLOGGED tables, a feature introduced with Postgres 9.5, to reduce IO dependence of our test (and speed up inserts).

Tables ended up being around 364..403 MB in size and looked like that in the end:

And the test queries looked like that:

The results

Datatype JOIN mean time 5m rows (s) % diff. to smallest Index Size (MB) % diff. to smallest
int4 2.72 - 107 -
int8 2.95 +8 % 107 -
numeric 3.65 +34 % 107 -
uuid 3.07 +13 % 150 +40 %

Looking at the results table, we see that indeed there's a penalty when starting to use any of the bigger data types. It makes kind of sense. But is this +13% growth, int4 vs UUID, little or much? I'd rather think it is small enough to be blissfully dismissed still, given that they benefit application architecture when using shards for example and also they provide "some" security as guessing or looping over them is not so easily possible. The thing that surprised me the most though was the performance of numeric-s. +34%! One should definitely watch out there and avoid them if possible – i.e. a very bad choice for normally incrementing integer ID-s.

The conclusion

As it was a pure syntethical test, in real life scenarios one has more processes fighting for resources, locking, also more bloat, most probably more columns in the tables, thus making waiting for disk access more relevant so that the real performance loss from processing those extra bytes spent on the ID column should be actually smaller.

By the way, by googling I also found a couple of articles on data type speed differences when doing a lot of INSERTs. There the penalty was similarly modest in most cases, so in general – nothing to be afraid of. So in the end I'd still recommend to choose ID column data types based on utility, not performance. And as always - one should always generate a lot of test data carefully within an application specific scenario before betting on a certain approach.

 


In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.

PostgreSQL support cases are coming in on a regular basis. This week an especially noteworthy one reached our desks here at Cybertec, which is so interesting, that I decided to sit down and share some information. I guess many people out there have similar issues and therefore this post might be helpful to developers and administrators out there.

Composite data types in PostgreSQL

PostgreSQL offers a very interesting feature: It is possible to create so called “composite types”, which are simply a collection of fields..

Here is an example:

In this example a type called “sample_type” is created. For the sake of simplicity it consists of three integer values. The beauty of PostgreSQL is that composite types can be used just like normal types (e.g. as columns, array, etc.). The use of composite types offers database developers a great deal of flexibility. Passing more complex data to functions is a lot easier than without composite types and therefore composites are a joy to work with.

However, when using a composite type it is also necessary to be cautious and to understand the basic inner workings of PostgreSQL.

Consider the following function:

The function returns two rows and sleeps for two seconds. So far things are fine.

However, consider the following trivial SQL statement:

As you can see the statement takes around 6 seconds. The reason is simple: The code is executed three times. Note that there is just one function call. However, PostgreSQL does something important behind the scenes: It expands the SELECT-clause. To PostgreSQL “(sample_func()).*” is not just a single call. It expands things before execution in a way that the function is called three times (one call for each column). While this is not a big deal for fast, tiny functions, it can be a HUGE issue for more complicated or slow code such as the function I have just written.

Reducing the number of function calls

To reduce the number of function calls the function can be moved to the FROM-clause of the query:

The execution time has dropped by 2/3 and the code only emits one NOTICE.

Additional performance tips

Sometimes it is hard to simply move the composite type to a FROM-clause. In those special cases it can make sense to work with subselects and NOTICE messages for debugging to get around the problem. Usually trying multiple variations of the query can lead to success.

If you have any questions regarding this issue, feel free to contact us and to leave a comment in the feedback section.

 


In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.

After some time of absence due to a high workload I finally got around to write more on an issue, which has been on my mind for quite a while: Removing duplicate rows.

It happens from time to time that PostgreSQL support clients accidentally load data twice or somehow manage to duplicate data (human error, bugs in applications, etc). The main question is, how can the problem be fixed? How can developers handle duplicate data and cleanup things? This post tries to answer exactly that question.

Detecting and finding duplicate entries

To demonstrate how duplicates can be found, I have prepared some sample data:

The idea is to store a couple of German cities:

Obviously some cities show up more than once:

“Dortmund” and “Frankfurt” are the ones, which have been added more than once and need a proper cleanup.

Identifying rows

Before we can approach the problem head on I want to tell you about a thing called “ctid”, which is used in PostgreSQL to identify the row in a table:

(0, 1) means that we are looking at the first block and the first row. (0, 2) is the second row in the first 8k block and so on. The trouble is: There is no such thing as “DELETE … LIMIT 1”. Therefore we have to find a way to use the “hidden“ columns to solve the problem. The ctid gives us a free identifier, which can help to delete just a single row.

Here is a query, which returns the first incarnation of a row:

Now that we have identified all the duplicate rows in the table, it is possible to write a DELETE statement, which just kills the undesired data:

As you can see the two rows we don't want any more are removed from the table. Mind that the ctid is not a static thing. UPDATE statements, VACUUM FULL, etc. can actually change it. Therefore it is highly recommended to perform this operation only if you are the only one on the system and if the table is locked appropriately.

 


In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.

Although normally one should try to avoid using non-core extensions/forks of Postgres, in some rare cases it could be even worse not to use them. But as BDR project is also here to stay (as I’ve understood main functionality will be gradually integrated into Postgres core), then the following info might still be useful in the future. So the thing is that when we got a client request to support them with custom replication handlers, we discovered to our surprise that the official documentation (and also our big brother, the Internet) is sadly pretty empty in that area with no working code examples. Maybe even on purpose, as kind of a warning that you’re already on slippery ice and should watch out. Who knows. But in short, after some poking around, I got acquainted with the topic and just to give you an idea what’s involved, a short overview.

About BDR in two sentences

First, for those not too familiar with the relatively new project of BDR (Bi-Directional Replication) – in short it enables asynchronously replicated Master–Master Postgres clusters (up to 48 nodes) where nodes are kind of independent of each other and they just try to communicate with all other nodes and fetch and apply (synchronize) their data changes on “best effort” terms. But to have any real chance at keeping data in sync BDR imposes quite some limitations on the database design – the biggest one being maybe that you must have UNIQUE or PRIMARY KEY constraints on all of your tables and exclusion constraints and DDL operations causing full table rewrites are disabled. Which shouldn’t be a big problem though in practice if you follow good database design best practices. But there are also some other aspects, so for general concepts see here and for DDL restrictions see for example here. And as there are some other corner-case issues (see GitHub) I would recommend doing some thorough testing for your concrete use case before going “all in” on BDR. But in general still a very powerful tool and definitely the future.

BDR conflict handling basics

But back to conflicts. In short – when you model your data according to BDR guidelines, most of the time the framework will take care of conflict situations where 2 nodes changed/inserted the same data-row simultaneously. This is possible as no global locking is done - all nodes operate on local node data and just try to sync with others automatically, by applying the “last timestamp wins” policy. That should also suffice for 99% of use cases. One just needs to remember that this also means that some data inserts or updates will just be thrown away and one needs to design the applications accordingly. More info on types of conflicts that BDR can resolve automatically can be found here.

But what to do when we still want to customize our conflict handling? For that the framework offers a possibility to declare standard PL/pgSQL stored procedures with a specific signature that will get the conflicting rows as input and the procedure can then decide whether it wants to discard one of the rows or to merge some column values. This merging use case though is probably the most wanted approach for custom handlers, so we’ll do a sample with that. Also note that custom handling is possible only where one UNIQUE constraints is violated (thus you shouldn't have too many)!

Setting up a sample handler function

User written conflict handler stored procedures have specific table (type) based inputs, meaning you’ll need at least one separate function for every handled table. But at least the handler function will get the conflict type (INSERT vs INSERT, INSERT vs UPDATE, …) as input so you can choose to have some “IF statements” to cover all the necessary conflict types for one table in the code.

A simple INSERT VS INSERT conflict handler that will merge for example our website hit counters (imagine we have two Data Centers with a Webserver and one BDR database node and some kind of inflexible log monitoring tool that can only do an INSERT on midnight with a fixed Primary Key ID translating to yesterday's date) will look like that:


After successful defining our conflict resolution handler jumps into action and we can see according messages in the server logs:

In addition to outputting to the server log it’s also possible to enable conflict handler “audit logging” into the specific bdr.bdr_conflict_history table by setting bdr.log_conflicts_to_table=on in the server config. There the same information will be just more nicely structured – but then you also might need some kind of a Cronjob to clean up the table periodically.

Avoid conflicts if possible

Although we’ve seen that conflict handling is kind of manageable, just some friendly advice – the best conflict handling scheme is to have no conflicts at all. So before starting with your application I’d recommend trying to design such a data model where all applications work with only some partitions of data or with UUIDs. An extract from the BDR documentation:

Conflicts can only happen if there are things happening at the same time on multiple nodes, so the simplest way to avoid conflicts is to only ever write to one node, or to only ever write to independent subsets of the database on each node.

NB! For production stuff make sure you use the latest 1.0.2 version of BDR as older ones had a bug where for insert/insert handlers the remotely changed tuple was not populated correctly. Also in case of more question or for support on this topic feel free to contact us.

CYBERTEC Logo white
CYBERTEC PostgreSQL International GmbH
Römerstraße 19
2752 Wöllersdorf
Austria

+43 (0) 2622 93022-0
office@cybertec.at

Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2024
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram