What are PostgreSQL's weaknesses? How can PostgreSQL be improved? Usually, in this blog I write about various fun topics around PostgreSQL - like perhaps new cool features, some tricky configuration parameters, performance of particular features or on some โlife hacksโ to ease the life of DBA-s or developers. This post will be quite different though. It's inspired by an article called "10 Things I Hate about PostgreSQL" which I stumbled upon recently. I thought Iโd also try my hand at writing about some less flattering sides of PostgreSQL ๐ And yes, there are some, for sure - as with any major piece of software.
Table of Contents
In the article I referenced, quite a few points are debatable from a technology viewpoint. However, in general I enjoyed the idea of exploring the dark side of the moon once in a while. Especially in our relational database domain, where problems tend to be quite complex - since thereโs a bunch of legacy code out there (PostgreSQL has been around sind 1996!), there are the official SQL standards to be adhered to, and there are a gazillion different database / OS / hardware combinations out there in the wild.
So yeah, you definitely shouldnโt get blinded and praise any product X univocally. Especially given that the software world has gotten so complex and is moving so fast nowadays, that my guess is that nobody actually has the time to try out and learn all those competing technologies to a sufficiently good level. Well maybe only in academia, where one is not expected to directly start generating $$, the faster the better.
About the title - although it was kind of tempting to also use the word โhateโ (more clicks probably), I found it too harsh in the end. In general - PostgreSQL is a truly well written and maintained software project, with a healthy Open Source community around it and it is also my default database recommendation for most applications. Its price - performance ratio is just yet to be beaten in the RDBMS sphere I believe. And there are far more benefits than shortcomings that I can think of.
The most important ones for me are its lightweightedness and simplicity for DBA operations (building a replication server is a one-liner!). From the development side, it is maybe the fact that Postgres is at the very top when it comes to implementing the ISO SQL standard, so that PostgreSQL can't really become a bottleneck for you. You could migrate to something else with a reasonable amount of work. In this area, a lot of commercial competitors have no real interest in such qualities - would you want to make migrating away from your product easy, if itโs generating you tens of thousands of dollars per instance per year?
So in short, as concluded in the post referenced, you canโt go wrong with PostgreSQL for most applications. But anyways, here are some of my points that I think could be improved to make PostgreSQL even better. Most of them are rather minor complaints youโll find, or just looking at the clouds and wishing for more.
This is related to the โno hints dogmaโ from the blog post referenced above (not to forget about the โpg_hint_planโ 3rd party extension though!) one of my biggest gripes with Postgres when it comes to complex queries - is the mostly static query planner. Mostly here means that except for some extra logic on choosing plans for prepared statements and PL/pgSQL stored procedures (which basically also boil down to prepared statements), the planner / execution doesnโt gather any feedback from the actual runs!
It pretty much cold-heartedly looks at the query, the pre-calculated statistics of tables mentioned in there, and then selects a plan and sticks to whatever happens. This holds true even if the row estimate for the first executed node is a million times off and some millions of rows are now going to be passed higher via a nested loop join, an algorithm thatโs rather meant for handling small to medium datasets.
I do understand that employing some adaptive query planning strategies is a huge task. It would require decent resources to tackle it. However, some simple heuristics like automatic re-analyze based on seen data, or trying some other join types in the presented example should be probably doable as โlow hanging fruitsโ. And it seems some people are already trying out such ideas already also - thereโs for example the AQO project. But for real progress some wider cooperation would be probably needed to keep up with the big commercial products, who already throw around buzzwords like โAI-based query optimizationโ. Not sure if just marketing though or they really have something, not much stuff on that on the interwebs.
Slightly connected to the first point is the idea of PostgreSQL automatically selecting some better values for default parameters, based on statistics gathered over time. Alternatively, it could just look at available OS resources. Currently, Postgres actually tries hard on purpose not to know too much about the OS (to support almost all platforms out there). For some reasonable subset of platforms and configuration settings, it could be done - theoretically.
I'm not talking about complex stuff even, but just the main things, like: optimizing the checkpoint settings (after some safety time + warnings), if checkpoints occur too often, increase autovacuum
aggressiveness in case thereโs lots of data changes coming in, look at memory availability and increase work_mem
to get rid of some temp files, or enable wal_compression
if the CPU is sitting idle. Currently, I think the only parameter that is automatically tuned / set is the โwal_buffersโ parameter.
Well, in general, the topic of tuning is not a real problem for seasoned DBA's. Actually, the opposite is true - itโs our bread ๐ However, it would surely benefit most developers out there. Also, this is what the competition is already touting. In the long run, this would benefit the project hugely, since DBA's are kind of a scarce resource nowadays, and extensive tuning could be off-putting for a lot of developers.
Iโm sure most people have not heard or seen this issue, so it canโt really be described as a major problem. It has to do with the fact that old table statistics are not carried over when migrating to a newer Postgres version using the pg_upgrade
utility, which is the fastest way to migrate, by the way. The original post talks about some hours of downtime... but with the โ--linkโ flag, itโs usually less than a minute!
As I said, itโs not a biggie for most users that have ample downtime for upgrades or have only normal simple queries, selecting or updating a couple of rows over an index. On the other hand, for advanced 24/7 shops with lots of data, it can be quite annoying. Itโs hard to predict the kind of spikes youโre going to get during those first critical minutes after an upgrade, when statistics are still being rebuilt. So here, it would be actually really nice if the statistics would not be โjust deletedโ with every release, but only when some incompatibilities really exist.
For some versions, I even looked at the pertinent structures (pg_statistic
group of tables) and couldnโt see any differences. I think itโs just a corner case and hasnโt gotten enough attention. There are also alternatives. If you really want to avoid such โiffyโ moments, you could use logical replication (v10 and above) instead of the good old โin-placeโ upgrade method. Some details on such upgrades can be found in this post about logical replication.
Also โfeaturedโ in the original post - historically speaking, the XID Wraparound and the seemingly randomly operated autovacuum background process have definitely been the number one problem for those who are not so up to date on how Postgres MVCC row versioning works, and havenโt tuned accordingly. At default settings, after some years of operation, given that transaction counts are also increasing steadily, itโs indeed very possible to have some downtime to take care of the gathered โdebtโ.
Still, Iโd say itโs not so tragic for most people as it is depicted, as it affects only very busy databases. And if youโre running with version v9.6 of PostgreSQL or higher, then thereโs a good chance that youโll never get any wraparound-related downtime, because the algorithms are now a lot smarter. It has improved even more as of v12, where autovacuum is much more aggressive by default. You can do quick vacuums by skipping index maintenance!
To defend Postgres a bit - you can enable early warning signals for such situations (log_autovacuum_min_duration
) and there are many internal statistics available...you just need to use this information and take action - the tools are there. But thereโs definitely room for more improvement. For example, one cool idea (from my colleague Ants) would be to allow explicit declaration of โtables of interestโ for long running snapshots. Currently, many autovacuum / wraparound problems are caused by the fact that long-running transactions, sadly, block pretty much all autovacuum cleanup activity within the affected database, even if weโre selecting a table that is completely static. Such declarations could probably also help on the replica side with reducing recovery conflict errors, making Postgres more load-balancing friendly.
- in this area there are also some very promising developments happening with the zHeap project, that aims to provide an alternative to the MVCC row storage model, reducing bloat and thereby surprises from the background processes like the autovacuum.
There have been major improvements in #4, although the fundamental problem is still there:
INSERT
, which can trigger VACUUM earlier and reduce the magnitude of anti-wraparound autovacuum for INSERT
-only tables (see this article about autovacuum and insert-only tables)The second is the more important improvement, but the first is cooler.
This is another well-acknowledged problem-area for PostgreSQL - its โon diskโ footprint is mostly a bit higher than that of its competitors. Normally this doesnโt show in real life, since most of your โworking data setโ should be cached anyway, but it can gradually slow things down. This is especially true if you run at default configs or fail to do any manual maintenance from time to time. But luckily, some things are already happening in this area, like the zHeap project and also a new hybrid store (row + column storage) called ZedStore is emerging that will compress the data size considerably. It will make Postgres more suitable for ultra large data warehouses - so letโs hope it makes it into Postgres v14.
One could throw in here also the full compression topic...but in my opinion itโs not really that important for OLTP databases at least, as enterprise SSD disks are fast as hell nowadays, and you have the option to dabble on the file system level. For those rare use cases where 10TB+ data warehouses with near real-time analytical query expectations are needed, Iโd maybe not recommend Postgres as a default anyway. There are better specialized systems which employ hybrid and columnar storage approaches for such needs; automatic block compression alone would not save us there.
Not to say that automatic compression is a bad idea, though: for example, the latest versions of TimescaleDB (a 3rd party PostgreSQL extension for time-series data) employ such tricks, can reduce the โon diskโ size dramatically, and thereby speed up heavier queries. Thereโs also another extension called cstore_fdw. It's an extension for data warehouse use cases, so there are already options out there for Postgres.
For #5, v13 has somewhat improved things with index de-duplication (at least B-tree indexes with duplicates take less space now).
That is only a slight step in the right direction, however.
I put this one in even though thereโs been some thought put into it, since thereโs a โcontribโ (bundled with Postgres) extension named โauth_delayโ for exactly this purpose. What's true about Postgres out of the box - thereโs no limit on password login attempts! Also, by default, users can open up to $max_connections
(minus some superuser reserved connections) sessions in parallel...this could mean trouble if you have some mean gremlins in your network. Note, however, that under โpassword loginโ I mean โmd5โ and โscram-256โ auth methods, meaning if you forward your authentication to LDAP or AD, you should be safe - but be sure to validate that. Or, you could enable the โauth_delayโ extension; better safe than sorry! Itโs really easy to crack weak passwords if you know the username - basically a one-liner! So it would be really nice to see some kind of soft limits kick in automatically.
N.B.! This doesnโt mean that you donโt get notified of the failed login attempts, but it could be too late if you don't monitor the log files actively! P.S.: thereโs also an extension which deals with the same โproblem spaceโ, called pg_auth_mon and we (CYBERTEC) also have a patch for support customers to disable such attacked accounts automatically after X failed attempts.
Don't forget though - by default, access config Postgres only listens to connections from the localhost, so the potential threat only occurs if some IP range is made explicitly accessible!
For #6,
password_encryption
to scram-sha-256 by default, which makes brute force attacks way harder, since scram hashes are more expensive.require_auth
, which allows the client to reject unsafe authentication methods and makes identity theft by faking the server way harder..This is one of those โwishing for moreโ or โnice to haveโ items on my list. The general idea behind it is that itโs quite wasteful to execute exactly the same deterministic query (fixed parameters, immutable functions etc) and scan some large tables fully every time, returning exactly the same results - if the data hasn't changed! There are probably some non-obvious technical pitfalls lurking around here...but some databases have managed to implement something like that. MySQL now decided to retire it in v8.0 due to some unfortunate implementation details...so the stuff is not easy. However, since performance is the main thing that developers (especially junior ones) worry about, and given that Postgres materialized views are also not always useful, something like that would be an awesome addition.
Phew...that was tiring already. Seems itโs not really that easy to find 10 things to improve about PostgreSQL...because itโs a really solid database engine ๐ So letโs stop at 7 this time, but if you have some additional suggestions, please leave a comment. Hope it broadened your (database) horizons, and do remember to decide on the grand total of pluses and minuses.
Read more about PostgreSQL performance and data types in Laurenz Albe's blog: UNION ALL, Data Types and Performance
In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.
+43 (0) 2622 93022-0
office@cybertec.at
You are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
I'd add "incremental backups" and "incremental updates for materialized views" to your list.
A big 1 on query hints; generally you don't don't need them, but when you do you need them badly. A related issue that I see a lot of complaints on is when the planner changes after the 5th query to a more generic plan that tends towards worse performance.
Still, all in all, I'm very happy with Pg, much more so than Mysql. ๐
How about an IF statement and support for better/easier date functions present in so many other languages?
"IF" should be the same as "CASE WHEN" I believe...or then stored procedures. But PG date functions seem pretty decent to me already, most additions / subtractions can be described in "human language" like "now() '15d 12h'::interval...do you maybe have some specific use case in mind?
Don't underestimate the value of table compression. In SQL Server 2008, I think row and page compression were the best things added to the product. It's not just about disks being fast enough. Most databases today are still I/O bound. That might change one day but it's what I see 90 % of the time right now. Anything you do that reduces I/O by 1/3 (like row compression), or by 4/5 (like page compression) is a big deal, especially when you combine that with the fact that in-memory page caches now hold vastly more rows with the same amount of memory. We work hard to pick the right blend of row and page compression and get really, really good outcomes. I wish it was in PostgreSQL.