Monday, August 24, 2020

What does Postgres do better than MySQL?

Here's a bit of an essay I wrote on what does Postgres do that MySQL doesn't? It's not comprehensive - just some of the features and issues I've encountered that I tried to pour out into that post. I tried to give MySQL a few props for what they've done better but I had to really , and most of it is just speculation (like, I've HEARD that managing its replication is easier...) Some of my criticisms are dated, so I try to edit and updated as needed.

Anyway, here it is:

I could probably rant for hours on this subject. My company uses MySQL, and while I love the place, I miss Postgres (and other DBs). I feel like I'm about 15-20 years backwards w/ MySQL. Some major points:

  • MySQL lags behind by FAR in keeping up with the times. MySQL 8.0 (released in 2018) addresses some major points, for example the standard features Window Functions (from SQL:2003) and CTEs (from SQL:99) which are incredibly useful. Just read that again. It took them until 2018 to release very useful (not useless junk), standard features that are from 2003 and 1999, respectively. That said, you might ask, well if they've caught up now then what's the problem? The problem is that it's a repeated trend I've seen since the early 2000s (MySQL devs were reluctant to implement features we take for granted now like transactions.) Furthermore, discussions like this lead me to believe I shouldn't have faith in the correctness of newer MySQL features. Furthermore they actually haven't caught up. There are some great SQL:92(!) (3 decades old!) features such as deferrable constraints, available in most major DBMSs that MySQL still lacks.
  • Generally, when MySQL does catch up with a feature that Postgres has had for years, it has major limitations. Examples: recursive CTEs do not have breadth-first or depth-first options as Postgres does, and does not have built-in cycle detection to avoid infinite loops (the MySQL blog does show you how to hack together your own solution, at least). Another example is you can't use user-defined stored functions in a MySQL functional index (called expression indexes in Postgres-ese). Once in a while, MySQL does win in this area - i.e. Postgres has the more limited feature - but that is far less frequent in my experience.
  • MySQL has a less flexible and less smart query planner.
  • Postgres has better explain/query plan analyzing features. For example, Postgres lets me view a "before the query is executed" plan estimate, as well as an "after execution" plan. The latter gives me detailed info of how the query actually ran, how long each specific step in the query took, indexes used, and how much memory each step consumed. E.g., I see three joins and a where clause: among those, the WHERE filter was quick and so were two of the joins, but one join was slow. This allows me to focus on the specific pain point. MySQL's EXPLAIN tool only gives me a "before" (much less valuable IMO) plan estimate, and the info it provides can be cryptic and shallow. So when I'm tuning MySQL queries, there's more experimentation and guesswork. (Edit: MySQL does have EXPLAIN ANALYZE now, as of 8.0.18 released on Oct. 2019; Postgres apparently introduced it in PG 7.2, released way back in 2002)
  • Transactional DDL is a huge advantage of PG, though this is more of a plus of PG than a negative of MySQL, as many other DBMSs lack this as well. I can create a table, insert some data, create an index, alter the table, roll it all back atomically if I want, or back to a savepoint in the middle. Flexible and it makes schema/data change management much cleaner (in the context of tools like Liquibase, or plain sql scripts). In MySQL, each DDL statement (such as that alter table) forces a commit.
  • MySQL has arbitrary index key limits - 767 bytes in that row format, 3k in this one, etc., that can be a pain sometimes. (Edit: I learned PG does have this too; but when using a row format in MySQL that had the low limit of 767 bytes, I found myself hitting that too often.)
  • PG has more indexing options: expression indexes, partial indexes, BRIN, GIST, etc. Without getting into specifics, each is useful wrt different kinds of data sets and kinds of queries. (Edit: I since learned that MySQL 8 does have something similar to expression indexes, i.e. functional indexes, but they have more limitations.)
  • Postgres lets you use heap tables, MySQL does not (assuming InnoDB storage engine). <TODO: add info>
  • Postgres has nice array handling. So you can store array types such as an array of ints or an array of varchars in your table. There are also various array functions and operators to read the arrays, manipulate them, and so on.
  • Postgres has materialized views, albeit limited ones. These basically let you cache query results and refresh them on demand. Great for reports, summary tables and such. Not as advanced as the MVs Oracle offers, but better than the nothing that MySQL offers.
  • MySQL has wacky settings like STRICT_MODE (and others) that deal with default behaviors, many designed to sweep errors under the rug. Auto-conversions of numeric types to strings and vice versa - doesn't sound so bad but can result in bad data and really weird stuff happening, like this that I once wrote about: my old blog While they can often be disabled, in my experience, few devs ever do so. Other weird behaviors include missing numbers and dates being generated as 0 or 0000-00-00 (I forget the exact circumstances), and others. Special meanings assigned to 0.. Headache inducing. :)
  • Nearly every advanced feature in MySQL - you dig into it and it has a bunch of weird, arbitrary limitations and quirks. Postgres - things generally work cleanly; limitations are rare but typically make sense. <TODO:examples>
  • UTF-8. Part of my job is to review my coworkers' DDL changes. Sometimes they create new VARCHAR columns with the character set "utf8". Unfortunately, "utf8" in MySQL is not standard UTF-8 - it is their now-deprecated, broken first attempt at UTF-8. The newer encoding, "utf8mb4" is the right one. So I have to advise devs repeatedly "No, utf8 is not actually UTF-8, utf8mb4 is. Use that instead." There is an overarching issue here, too. Generally, PG has well thought-out designs, from features down to the naming conventions. I follow the pg-hackers newsgroup, in which the devs discuss all things related to PG development. They would bang on each others' ideas, constructively criticize, ensure solutions are consistent with the PG Way. There's probably some of that in MySQL, but I can't imagine the PG devs making a broken UTF-8 implementation.
  • TEXT types in MySQL come with a slew of baggage, See here. But your average dev leans towards TEXT when they can't think of a limit for their VARCHAR. (MySQL forces you to specify a max length, such as VARCHAR(50).) In PG, you can define a VARCHAR without a length, or TEXT if you want, with zero baggage. As basic as this point is, I find it to be hugely valuable!
  • MySQL allows non-standard SQL styles that are "not good" IMO, like selecting columns not in the GROUP BY and quoting string literals with either single or double quotes. (The GROUP BY variant can be valid, but often has unexpected side effects that most devs don't realize.) Non-standard backticks around identifiers...
  • MySQL is only starting to implement parallel querying - it supports exactly two, almost useless niche cases e.g. SELECT COUNT(*) FROM my_table(no WHERE clause allowed), while PG's have been solid in many kinds of queries for years. (Parallel querying meaning multiple threads each execute part of your query simultaneously.)
  • Overall PG is cleaner, with better, well-thought-out design, and fewer wonky "gotchas". (how about that MySQL 5.5 and earlier timestamp madness? Oh the headaches that's caused!)
  • Hash joins: an efficient algorithm in some situations used to execute your joins, was only just released in MySQL 8, but to quote the MySQL Blog: "MySQL only supports inner hash join, meaning that anti, semi and outer joins are still executed using block-nested loop." In PG, it works in more situations. Again, a common theme is more limitations with advanced features in MySQL.
  • Postgres supports many procedural languages such as pl/pgsql, pl/java, pl/python. So you can actually write code that strongly resembles Java or Python in your PG stored procedures. Also you can use such procedural code in (see next bullet point)...
  • Anonymous blocks: A stored procedure has to be created and it lives on the database. What if you just want a one-off proc? You could create the proc, execute it, then drop it. Worry about naming collisions, permissions to create and drop a proc, etc. But with an anonymous block: send a "script" over to your database and it runs, without creating any db object. This can be handy.

Those are some major points.

What does MySQL do better? You can tell I'm biased, but I'll do my best to be fair:

  • It was more popular for a while - it may still have an edge in that area though I think the margin has narrowed considerably. Popularity has the advantage of your skillset in the more popular tool being more valuable in the job market, and cloud services and the like are likely to support more popular ones first. E.g., back when I started using Amazon RDS years ago, it supported MySQL but not PG, so I reluctantly chose MySQL for a project. Now it supports both.
  • I've heard, though don't have firsthand experience with this - that MySQL's replication is easier to manage.
  • When you update a row in PG, it is marked as a "dead tuple", a new row is written, and dead tuples are cleaned up asynchronously by a process called the autovacuum daemon. It's like a garbage collector. This is not necessarily bad, and is often good. But if you have an extremely high rate of updates (such as thousands of update transactions - not to be confused with rows updated - on a single table per second), your table might perpetually bloat as the daemon cannot keep up. This can in turn lead to an issue: all 32-bit values of transaction IDs are assigned, crashing PG.
    • Worth noting: this is talked about to a disproportionately high degree as a red flag, because it sounds so bad, but in practice this is extremely rare. You would have to use up all 32-bit transaction IDs - FOUR BILLION - AND it must be the case that the daemon can't keep up. Keep in mind that Postgres provides you with configurable parameters for autovacuum daemon's behavior, so even in such rare circumstances it may be manageable.
  • Postgres connections are more heavyweight, as each uses a server process. Some think this means PG is "not scalable". They may be right to some degree, but this shouldn't turn you off to PG. (There are different dimensions to scalability.) You may need to tighten connection management if you need beyond, say, thousands of simultaneous connections. This is being reviewed by PG devs last I heard (2023).

Let's keep it going:

  • Want to add a column to your 200-million row table? Let's say it has columns a, b and c and you want to add d.
    • Here's what Postgres does under the hood:
      • Add column d. The end.
    • Here's what MySQL does under the hood (EDIT: this one is less true, i.e. not true in all circumstances, as of MySQL 8.0, since the INSTANT DDL algorithm was added):
      • Create new table with columns a, b, c and d.
      • Copy all the 200 million rows over to new table.
      • Recreate indexes and whatever else depends on the table. Not sure what else really, but I wonder how views interact with this.
      • Drop the old table.
      • Rename the new table

I'm probably oversimplifying the Postgres case a little bit, as it too may have to update views and such, as well as allocate disk, but the key point is it certainly doesn't have to move potentially vast quantities of data around for no reason. Now it MAY be wise to rebuild tables/indexes occasionally anyway, but being forced to do that with every change - no thanks.

  • Want to work with data dumps?
    • Postgres's tool pg_dump lets you dump data to a variety of formats: a plain-text SQL script, a custom binary format, a directory or a tar. The "custom" format is great because it's nice and flexible. Say I dump an entire schema of tables a,b,c and d, and all dependent objects (indexes, views, etc.). Later, if I want to restore just tables b and d, I tell pg_restore something like --tables=b,d. Voila, it works.
    • MySQL's mysqldump only lets me write the plain-text SQL. It is a script, full of commands like CREATE TABLE t, INSERT INTO t, etc. At my job, we generate these every night and they are tens of gigabytes. Of plain-text SQL commands. Good luck if you only want to restore tables b and d. You have to edit the text file to comment out a, and c, all the dependent objects and all the insert statements. Maybe use tools like awk or sed, I guess, if the file is so huge that you can't open it in your favorite text editor. Really fun times if your db is in the GB, TB or bigger range. I actually thought about writing a tool to automatically handle this, but lost the motivation.
  • Postgres has the hierarchy: cluster (an instance of PG essentially) => database => schema => table (and other objects). In MySQL databases and schemas are synonymous. Why? I don't know. Maybe not the biggest deal technically, but I find it creates confusion in a cultural sense. For example I often hear colleagues say things like, "Well THAT's a separate database, we shouldn't have a foreign key from table x to table y if they are in separate databases! That's dangerous because they might reside on different servers one day." They also might write database creation scripts operating on the idea that one schema == one siloed "database" even when that's not ideal. Nomenclature matters. But maybe I designed table x and y to reside in different SCHEMAS in the SAME DATABASE, it's just that MySQL makes no distinction. So in Postgres I might just put x and y in different schemas under the same database, it is clear that they are roommates who live together, and that's much less confusing. The PG way also opens the door to logical grouping of db objects other than tables. For example if I have a lot of stored procs, once you get beyond 20 or 30 of them you use your GUI client of choice to view them all, and what you see is this massive, mind-boggling, flat list of them all in one place. With the concept of separate schemas in the same database, I could make one schema to encapsulate all of, say, billing procedures, another for database administrative tasks, another for order management. It's a great way to divide up procedures (and other objects) logically without conflating with the concept of databases (which "everyone knows" are distinct entities).
  • JSON types - MySQL's JSON stuff is getting pretty good, but Postgres's JSON/JSONB types and related functions are more robust, with better indexing features.

Editing much later to add another point:

MySQL has mandatory indexes on foreign key columns: OK I will first admit this could be good or bad depending on your point of view. I find it bad. To briefly summarize the context, it is sometimes, but not always, a good idea to build indexes on your foreign key columns. If you never look up child rows given a parent row, and never use cascading effects like ON DELETE CASCADE, then you don’t need to index a foreign key. But MySQL forces you to create these indexes even when they’re not necessary. Of course, indexes take time to build, take up disk space, slow down updates/inserts etc. On larger tables you could have indexes that are hundreds of MB or even GB, and if you don’t need them, that’s really silly to force us to build them.

Furthermore it’s just annoying for them to be mandatory, when it blocks other operations for little reason. Just recently I was expanding a MySQL table’s unique key from columns a,b to a,b,c. So my first idea was to drop the unique key then recreate it on columns a,b,c. Simple enough. BUT, b was a foreign key column so I got an error on the drop unique key command. So instead I had to:

  1. drop the foreign key
  2. drop the unique constraint
  3. create the new unique constraint
  4. recreate the foreign key

I can, however, see the “good” point of view being that careless developers need indexing on all foreign keys to be enforced, or else they’ll never bother to make them when they ARE necessary, lest they cause performance/locking issues. I just don’t agree with this POV, though. I would prefer to just educate the developers working on schema design to create indexes judiciously rather than just blanket-enforce all FKs must be indexed everywhere.


SQL here != SQL there.

Can we please stop thinking some SQL command you used on database server x is universal? As you may be aware, I love answering questions on reddit, r/sql, to help people out with their SQL questions. But unfortunately I'm constantly finding myself having to ask, "Which database are you using?" as well as correcting others who suggest solutions that only work on a different database than what the person having the problem is using. This is because people think SQL works the same everywhere, I guess. 

There are extensions to SQL on all the major database systems. None of them implements every feature in all of the SQL specs. Even if they did, the SQL specs have many optional features. So for one feature, it might say, "This feature works like THIS, but OPTIONALLY it works like THAT." So even if we had two SQL databases that fully implement SQL (we don't), they could vary in behavior.

This morning I was frustrated by getting downvoted when correcting a commenter who gave a MS SQL Server-specific solution to a poster who didn't even say he's on MS SQL Server! 

So I made this meme to vent my frustrations.


SQL here != SQL there.