Friday, December 06, 2024

Evil MySQL Feature: REPLACE INTO

Yes, my friends, it's time for another EVIL MYSQL FEATURE!

Often used as an “upsert” (insert if not exists, else update), REPLACE INTO is another non-standard, MySQL-specific feature that makes one ask the question, “What were they thinking??”

Fortunately, once again, we have INSERT INTO...ON DUPLICATE KEY UPDATE which, while slightly less snappy to type, is pretty much always the better solution. But the developer has to know to use this statement and to toss REPLACE INTO into the fire. πŸ”₯ Again, standard SQL MERGE (which is also an upsert) is yet to be implemented in MySQL

REPLACE INTO suckage basically all stems from the fact that in the case when the row already exists, the row is deleted and a new one is inserted.

To dig into why that’s bad a little bit:

1. REPLACE INTO sucks for performance. I can show the full test details if anyone’s interested. But to briefly show the results of a benchmark I whipped up, that ran a whole bunch of REPLACE INTO vs. logically identical INSERT INTO...ON DUPLICATE KEY UPDATE statements on the same large table: REPLACE INTO: ~1m 9s vs. INSERT INTO...ON DUPLICATE KEY UPDATE: ~ 0m 21s (the discrepancy can be more or less pronounced depending on some factors such as number of indexes on the table; my test case had a few indexes)

2. It may cause unnecessary foreign key violations. Because REPLACE INTO runs a DELETE followed by an INSERT, it can trigger pointless foreign key violations if the table you’re working with is a parent table in a relationship.

Demo:

/* REPLACE INTO goes KABOOM */
mysql> REPLACE INTO parent (id, a, b) VALUES (1, 123, 456);
ERROR 1451 (23000): Cannot delete or update a parent row: a foreign key constraint fails (`core`.`child`, CONSTRAINT `fk_child_parent` FOREIGN KEY (`parent_id`) REFERENCES `parent` (`id`))

/* alternatively, INSERT INTO...ON DUPLICATE KEY UPDATE LGTM */
mysql> INSERT INTO parent (id, a, b)
    -> VALUES (1, 123, 456) AS new
    -> ON DUPLICATE KEY UPDATE
    -> a = new.a,
    -> b = new.b;
Query OK, 2 rows affected (0.00 sec)

mysql> SELECT * FROM parent WHERE id = 1; /* and we see the update worked */
+----+------+------+
| id | a    | b    |
+----+------+------+
|  1 |  123 |  456 |
+----+------+------+
1 row in set (0.00 sec)


3. Similar principle: if your foreign keys are defined as ON DELETE CASCADE, you won’t get an error, but you’ll get a surprise in that the child rows are deleted! All you want to do is to update or insert in the parent, and it kills all the child rows! Say we have the same parent and child tables, but this time, child’s foreign key was defined with ON DELETE CASCADE:

mysql> SELECT * FROM parent;
+----+------+------+
| id | a    | b    |
+----+------+------+
|  1 |  123 |  456 |
+----+------+------+
1 row in set (0.00 sec)

mysql> SELECT * FROM child; /* here are the children of parent id 1 */
+----+------+------+-----------+
| id | a    | b    | parent_id |
+----+------+------+-----------+
|  1 |    9 |    9 |         1 |
|  2 |    8 |    8 |         1 |
+----+------+------+-----------+
2 rows in set (0.00 sec)

mysql> REPLACE INTO parent (id, a, b) VALUES (1, 123, 456); /* DANGER */
Query OK, 2 rows affected (0.01 sec)

mysql> SELECT * FROM child; /* OH GOD NO!! */
Empty set (0.00 sec)

<snip rolling back the above change>

mysql> INSERT INTO parent (id, a, b) /* whereas this is safe */
    -> VALUES (1, 123, 456) AS new
    -> ON DUPLICATE KEY UPDATE
    -> a = new.a,
    -> b = new.b;
Query OK, 2 rows affected (0.00 sec)

mysql> SELECT * from child; /* whew... */
+----+------+------+-----------+
| id | a    | b    | parent_id |
+----+------+------+-----------+
|  1 |    9 |    9 |         1 |
|  2 |    8 |    8 |         1 |
+----+------+------+-----------+
2 rows in set (0.00 sec)

4. Your preexisting row might get assigned new auto-incremented ID value. If the key that REPLACE INTO is checking against is the ID/primary key, then it will retain the value as expected - no problem in this case. But say you have such an ID as your PK, and a unique constraint on another column. If the unique column is instead used to match (i. e. your REPLACE INTO statement doesn’t specify ID), the row receives a new ID value.

mysql> CREATE TABLE dummy (id INT AUTO_INCREMENT PRIMARY KEY, name VARCHAR(10) UNIQUE, x INT); /* the key thing here is we have an auto-incremented ID and a unique constraint on name */
Query OK, 0 rows affected (0.01 sec)

mysql> INSERT INTO dummy (name, x) VALUES ('thing1', 1), ('thing2', 2); /* populate table with some data */
Query OK, 2 rows affected (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> SELECT * FROM dummy; /* let's take a look at the newly inserted data, notice the id values */
+----+--------+------+
| id | name   | x    |
+----+--------+------+
|  1 | thing1 |    1 |
|  2 | thing2 |    2 |
+----+--------+------+
2 rows in set (0.00 sec)

mysql> REPLACE INTO dummy (name, x) VALUES ('thing1', 999); /* try to REPLACE INTO by unique name */
Query OK, 2 rows affected (0.00 sec)

mysql> SELECT * FROM dummy; /* notice id changed from 1 to 3!! OMG WTF :scream: */
+----+--------+------+
| id | name   | x    |
+----+--------+------+
|  2 | thing2 |    2 |
|  3 | thing1 |  999 |
+----+--------+------+

Maybe the above won’t happen too often, and maybe it doesn’t always matter, but I personally don’t like unexpected data changes, especially primary key changes (what if I have some cached data in my application/ORM? what if people tend to memorize or put IDs in stored requests — imagine those changing sporadically?) It’s just another odd case to worry about it - I’d rather remove that worry entirely:

mysql> INSERT INTO dummy (name, x)
    -> VALUES ('thing1', 999) AS new
    -> ON DUPLICATE KEY UPDATE
    ->     name = new.name,
    ->     x = new.x;
Query OK, 2 rows affected (0.00 sec)

mysql> SELECT * FROM dummy; /* id 1 is still id 1 (yay) */
+----+--------+------+
| id | name   | x    |
+----+--------+------+
|  1 | thing1 |  999 |
|  2 | thing2 |    2 |
+----+--------+------+

5. I have HEARD but could not whip up a quick test case to prove, that REPLACE INTO can cause more heavyweight locking/increase risk of deadlocks. (Source) However my quick tests on MySQL 8 showed the same amount of locking involved. TBD I suppose - maybe I’ll try to dig more deeply into it at some point.

All that said, if you ever really need a DELETE followed by an INSERT, REPLACE INTO may be justified. I feel like I’ve seen the need to do so on rare occasion, but I can’t think of a specific use case right now. But for upsert purposes, do not use REPLACE INTO. It is evil.

tl;dr: MySQL’s special REPLACE INTO statement is slower and breaks stuff more frequently then INSERT INTO … ON DUPLICATE KEY UPDATE . So instead of REPLACE INTO, use INSERT INTO … ON DUPLICATE KEY UPDATE.
 

Wednesday, December 04, 2024

Evil MySQL Feature: INSERT IGNORE

πŸ‘ΏπŸ‘ΏπŸ‘Ώ WARNING: EVIL LIES AHEAD. ABANDON ALL HOPE--ETC., ETC. πŸ‘ΏπŸ‘ΏπŸ‘Ώ

The non-standard MySQL IGNORE keyword, when used in conjunction with INSERT, is often thought of by SQL coders as "insert, and if duplicate key, do nothing". But that's not the entire story.

IGNORE is a hack to ignore a slew of errors, not just duplicate key violations (per https://dev.mysql.com/doc/refman/8.4/en/sql-mode.html):

    ER_BAD_NULL_ERROR
    ER_DUP_ENTRY
    ER_DUP_ENTRY_WITH_KEY_NAME
    ER_DUP_KEY
    ER_NO_PARTITION_FOR_GIVEN_VALUE
    ER_NO_PARTITION_FOR_GIVEN_VALUE_SILENT
    ER_NO_REFERENCED_ROW_2
    ER_ROW_DOES_NOT_MATCH_GIVEN_PARTITION_SET
    ER_ROW_IS_REFERENCED_2
    ER_SUBQUERY_NO_1_ROW
    ER_VIEW_CHECK_FAILED 

If you don't know what all of those errors mean off the top of your head, well I don't either. But let's check into a couple of them. 

ER_NO_REFERENCED_ROW_2 means:
"Message: Cannot add or update a child row: a foreign key constraint fails%s"

Demo:

mysql> SELECT id FROM parent; -- just showing the parent table's contents
+----+
| id |
+----+
|  1 |
|  2 |
+----+
2 rows in set (0.00 sec)

mysql> SELECT * FROM child; -- just showing the child table is empty
Empty set (0.01 sec)

mysql> INSERT IGNORE INTO child(a, parent_id) VALUES (1, 999); -- try to INSERT IGNORE a row with invalid parent_id
Query OK, 0 rows affected, 1 warning (0.01 sec)

mysql> SELECT * FROM child; -- no data was actually inserted because of the FK violation
Empty set (0.00 sec)

So essentially INSERT IGNORE will silently ignore a foreign key constraint failure as well. (Technically not silently as it becomes a warning, but hardly anybody checks for warning responses in their code.)

How about the error ER_SUBQUERY_NO_1_ROW, meaning:
"Message: Subquery returns more than 1 row"

Demo:
mysql> INSERT IGNORE INTO child (a, parent_id)
    -> SELECT
    ->     (
    ->     /* subquery that I expect to only return 1 row, but occasionally that's not true */
    ->     SELECT x FROM some_other_table WHERE x = yat.x
    -> ),
    -> 1
    -> FROM yet_another_table yat;

Query OK, 1000000 rows affected, 65535 warnings (5.15 sec)
Records: 1000000  Duplicates: 0  Warnings: 92622

If I run SHOW WARNINGS; I see a bunch of:

    | Warning | 1242 | Subquery returns more than 1 row |

All of those 92622 rows inserted with warnings now have NULL values for a:

mysql> SELECT COUNT(*) FROM child WHERE a IS NULL;
+----------+
| COUNT(*) |
+----------+
|    92622 |
+----------+
1 row in set (0.13 sec)

Also, a fun (/s) thing is if I repeat the last experiment, except this time add a NOT NULL constraint on child, the "bad" values will be set to 0 instead of NULL.

/* snip here - I repeated the INSERT IGNORE after adding the NOT NULL constraint on `a`, but let's look at the results */

mysql> SELECT COUNT(*) FROM child WHERE a IS NULL;
+----------+
| COUNT(*) |
+----------+
|        0 |
+----------+
1 row in set (0.00 sec)

mysql> SELECT COUNT(*) FROM child WHERE a = 0;
+----------+
| COUNT(*) |
+----------+
|    92622 |
+----------+
1 row in set (0.13 sec)



Now you may be wondering, "what if I use strict mode, shouldn't less automagic weirdness happen?" Well I don't want to get into additional experiments for now, so I'll just let the MySQL documentation speak about that:

...when the IGNORE keyword and strict SQL mode are both in effect, IGNORE takes precedence. This means that, although IGNORE and strict SQL mode can be considered to have opposite effects on error handling, they do not cancel when used together.

I have to wrap up this post shortly, but notice we haven't even looked at ER_NO_PARTITION_FOR_GIVEN_VALUE , ER_VIEW_CHECK_FAILED, and many other errors. So many weird cases to consider!

πŸ‘ΏπŸ‘ΏπŸ‘Ώ  Just don't use INSERT IGNORE. It is evil. πŸ‘ΏπŸ‘ΏπŸ‘Ώ

But there's hope...The ACTUAL way to do "insert, and if duplicate key, do nothing" in MySQL is something like:

INSERT INTO your_table (column1, column2, ...)
VALUES (value1, value2, ...)
ON DUPLICATE KEY UPDATE id = id; -- there's no `DO NOTHING` no-op in MySQL, so just set a dummy update


It's unfortunate that MySQL doesn't simply support the standard SQL MERGE statement, but MySQL gonna MySQL. Example of standard MERGE:

MERGE INTO customers AS target
USING new_customers AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN
    UPDATE SET target.name = source.name, target.email = source.email
WHEN NOT MATCHED THEN
    INSERT (customer_id, name, email)
    VALUES (source.customer_id, source.name, source.email);

I will wrap up this post by saying -- if you are aware of, and have accounted for, ALL of the possible ignored errors, or you just don't care about any of them, then have it and INSERT IGNORE yourself silly, I suppose! But for my money, if you use INSERT IGNORE with the most common intention - i.e. "insert, and if duplicate key, do nothing" - you're much better off doing the INSERT INTO ... ON DUPLICATE KEY UPDATE ...

Sunday, September 29, 2024

Arrays vs. IN-lists in SQL Using JDBC: Techniques and a Glance Under the Hood

Intro:

Alright, so say you want to run a query that has the WHERE clause WHERE id IN (...) from your Java program, using JDBC, where the content of your IN list is a variable list of integers. There could be one of them; could be 100,000.

Technique 1: Concatenation (πŸ‘Ž)

One way to do it is to concatenate all the integers, separated by commas, directly into the query string. It is probably safe from SQL injection in this case, so long as we are sure they are all integers, but there may be other downsides depending on the DBMS and configuration. For example, many statement caches will treat each variation as a distinct query, and essentially flood the cache with duplicates. (This is how Oracle's library cache will do it by default. YMMV.)

Personally, I’m a purist. It’s against my personal belief system (πŸ˜…) to concatenate parameters ever, unless I have no other option. I’ll omit the rant on that topic for now. But let’s just accept this premise - never concatenate SQL parameters - for the scope of this exercise. 

(OK I lied a little. The short version of the rant is I believe concatenating parameters is like entering a minefield. While you may be able to safely navigate the minefield in some cases, it's best never to go into it when better alternatives exist. Also, consider that others may follow by example, and not understand the nuances of how you safely did it!)

So given that premise, this technique must be thrown away. (Although in the real world, it's the most common one I've seen!)

Technique 2: Parameterize the IN list

A better way to do it is to take your integer list, generate the right number of ?s (or whichever symbol) you need, but that also has its issues, including the aforementioned duplication of queries in a statement cache (one for id IN (?), another for id IN (?,?), yet another for id IN (?,?,?) , etc. Another potential issue: some database software will limit the number of parameters you can bind. And this just feels cumbersome IMO. You write something like a loop, iterating n times to concatenate the right number (n) of question marks in the query string, then bind parameters in a loop, also n times. 

A partial example of how to accomplish this:

// generating the comma-separated list of question marks
String
placeholders = Arrays.stream(ids) =
                            .map(id -> "?")
                            .collect(Collectors.joining(", "));

// stick those placeholders into the query
String sql = "SELECT * FROM t WHERE id IN (" + placeholders + ")";

// bind parameters in a loop
for
(int i = 0; i < ids.length; i++) {
    pstmt.setInt(i + 1, ids[i]);
}

This is much better than concatenating the values directly IMO. At least we are parameterizing.

Technique 3: Use a Single Array Parameter

What if we were to write our query to work with a single array containing all the integers we want to search for? (Syntax varies per DBMS)

MySQL

If you are working with MySQL, unfortunately MySQL doesn’t even have native arrays (SQL:99 contains the spec for arrays. You would think this would be a reasonable thing that all major DBMSs should have implemented by 2024!!). However MySQL does at least have JSON objects which, of course, can contain arrays. So maybe we could create a JSON array containing the integers we want to search for.

Let's give it a shot. Self-imposed requirement: the query must be sargable, meaning an index lookup is possible.

I found it can be done, although it’s not super pretty. Not terrible, but not gorgeous. It looks like this: https://pastebin.com/EV3VxwZ2 

The most important parts of this code example are:

  • Line 11: defining the array of data to set as parameter; notice this is just a string
  • Lines 14-16: the query string
  • Line 19: setting the parameter

Again, this is a sargable query, meaning an index on the column id can be, and is according to my testing (see the execution plans section below), used for the search. A rule of thumb - not a perfect one - is if you’ve got a search expression like colname = <anything>, with an index on colname, it’s sargable. If you instead put your lookup column in the function call as a parameter, for example my_function(id, ...), then it’s not. (I’m glossing over some exceptions I’ve seen, and also functional/expression indexes.)

There are other ways you could form this query that may be easier on the eyes, but they are unfortunately NOT sargable and therefore force a full table scan, for example:

SELECT * FROM t WHERE JSON_CONTAINS(?, CAST(id AS JSON)); 

So, I can’t recommend this method unfortunately. Maybe there is another solution to do it with JSON that is more handsome that I haven’t thought of.

It’s worth noting that if you omit the CAST on value in the subquery, it breaks sargability. Why? Because the value returned by JSON_TABLE is of type JSON. Even though MySQL can handle this without error, it fails to use an index in that case. (Similar to how, if you run something like WHERE varchar_id IN (123, 456), it breaks sargability. See this past post for a demonstration.)

PostgreSQL

Why isn't it this easy on every DBMS?!? Here is the Postgres version of the Java code. https://pastebin.com/PYNxSbHU (Lines 11-15 are the important parts in this example.)

Much prettier; zero hackiness. Simply define your Integer[] in Java, then write your query:

SELECT * FROM t WHERE id = ANY(?);

MySQL Execution Plans: IN List vs. Array

In MySQL, the plans are not at all the same between the IN query and the array one, but both are about equally “very fast” with a small set of inputs.

mysql> /* IN query */
EXPLAIN ANALYZE
SELECT * FROM t
WHERE id IN (1,2,3,4,5, 100, -999, 9999, 9941394);

 -> Filter: (t.id in (1,2,3,4,5,100,<cache>(-(999)),9999,9941394))  (cost=8.47 rows=9) (actual time=0.524..0.695 rows=8 loops=1)
    -> Index range scan on t using PRIMARY over (id = -999) OR (id = 1) OR (7 more)  (cost=8.47 rows=9) (actual time=0.521..0.69 rows=8 loops=1)

1 row in set (0.01 sec)

mysql> /* JSON query */
EXPLAIN ANALYZE
SELECT * FROM t WHERE id IN (
    SELECT CAST(value AS UNSIGNED)
    FROM JSON_TABLE(
        '[1,2,3,4,5, 100, -999, 9999, 9941394]',
        '$[*]' COLUMNS (value JSON PATH '$')) AS jt
    );

 -> Nested loop inner join  (cost=1.3 rows=2) (actual time=0.214..0.397 rows=8 loops=1)
    -> Filter: (<subquery2>.CAST(value AS UNSIGNED) is not null)  (cost=0.2..0.4 rows=2) (actual time=0.0855..0.0875 rows=9 loops=1)
        -> Table scan on <subquery2>  (cost=2.5..2.5 rows=0) (actual time=0.0838..0.085 rows=9 loops=1)
            -> Materialize with deduplication  (cost=0..0 rows=0) (actual time=0.0834..0.0834 rows=9 loops=1)
                -> Filter: (cast(jt.value as unsigned) is not null)  (actual time=0.0611..0.0637 rows=9 loops=1)
                    -> Materialize table function  (actual time=0.0538..0.0551 rows=9 loops=1)
    -> Filter: (t.id = <subquery2>.CAST(value AS UNSIGNED))  (cost=0.802 rows=1) (actual time=0.0338..0.0339 rows=0.889 loops=9)
        -> Single-row index lookup on t using PRIMARY (id=<subquery2>.CAST(value AS UNSIGNED))  (cost=0.802 rows=1) (actual time=0.0335..0.0336 rows=0.889 loops=9)
 |
1 row in set (0.01 sec)


I suspect the situation may be different with a large set of inputs.

Postgres Execution Plans: IN list vs. Array

The Postgres execution plan for the two queries follow:

postgres=# EXPLAIN ANALYZE SELECT * FROM child WHERE id = ANY(ARRAY[5, 10, 11, 200]);
                                                     QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
 Index Scan using child_pkey on child  (cost=0.43..21.81 rows=4 width=27) (actual time=0.094..0.537 rows=4 loops=1)
   Index Cond: (id = ANY ('{5,10,11,200}'::integer[]))
 Planning Time: 3.148 ms
 Execution Time: 0.568 ms
(4 rows)

postgres=# EXPLAIN ANALYZE SELECT * FROM child WHERE id IN (5, 10, 11, 200);
                                                     QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
 Index Scan using child_pkey on child  (cost=0.43..21.81 rows=4 width=27) (actual time=0.069..0.086 rows=4 loops=1)
   Index Cond: (id = ANY ('{5,10,11,200}'::integer[]))
 Planning Time: 0.329 ms
 Execution Time: 0.122 ms
(4 rows) 



Postgres doesn't care! There's no funky quirk or logical/performance discrepancy to worry about between the two, as Postgres translates both queries to the exact same execution plan.

This allows us to avoid any worry about how these two queries might differ performance-wise, and focus on the style that is cleanest to us. Awesome job, Postgres dev team!

How about Oracle?

I started investigating this subject in Oracle when a r/sql user said their IN list maxed out at 1000 values (an Oracle limitation). What I found is if you use the array technique with Oracle over JDBC with a bound parameter, there's no problem at 100,000 values (I don't know if there's truly "no limit", but whatever limit there might be is > 100,000)

So I whipped up the same kind of demo, binding an array. Notice the ids array is sized at 100,000 and populated in a loop: https://pastebin.com/91eEbSHx (Important parts: lines 16-18, 22 and 25)

Tiny point of annoyance: the object type for the NUM_ARRAY used in this Oracle example must be created ahead of time: CREATE OR REPLACE TYPE NUM_ARRAY IS TABLE OF NUMBER; (There are other object types you could use in Oracle as well.)

MySQL Performance Test: search by a large number of ints, IN vs. array

Ok MySQL performance test with a BIG number of IDs to search for and… I was not expecting this!!!

Repeated multiple trials of the JSON array lookup vs. old-fashioned ID search and average times: 

JSON: average time to execute is slightly less than 0.2 sec
IN: average time to execute is about 2.6 sec

Additional info. Table size, ~13M rows, and not a very wide table:


mysql> select count(*) from t;
+----------+
| count(*) |
+----------+
| 13315201 |
+----------+

mysql> describe t;
+-------+------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra          |
+-------+------+------+-----+---------+----------------+
| id    | int  | NO   | PRI | NULL    | auto_increment |
| x     | int  | YES  |     | NULL    |                |
+-------+------+------+-----+---------+----------------+

List of IDs to search for generated by this seq command on MacOS (should run out of the box on any Linux machine as well):

seq --format '%f' -s, 1 100 5000000

Which means: generate integers from 1 through 5,000,000 in increments of 100 

If you own code that is doing an IN search by a large list of IDs, maybe give a JSON array search a try. No guarantees, but you may get big gains, bro!

The respective execution plans:

IN:
 -> Filter: (t.id in (1,101,201,301,401,501,601,<snip>
,4998701,4998801,4998901,4999001,4999101,4999201,4999301,4999401,4999501,4999601,4999701,4999801,4999901))  (cost=1.35e+6 rows=6.65e+6) (actual time=2.1..2424 rows=50000 loops=1)
    -> Table scan on t  (cost=1.35e+6 rows=13.3e+6) (actual time=2.1..1462 rows=13.3e+6 loops=1)


JSON array:

| -> Nested loop inner join  (cost=1.41 rows=2) (actual time=35.9..184 rows=50000 loops=1)
    -> Filter: (`<subquery2>`.`CAST(value AS UNSIGNED)` is not null)  (cost=0.2..0.4 rows=2) (actual time=33.8..39.3 rows=50000 loops=1)
        -> Table scan on <subquery2>  (cost=2.5..2.5 rows=0) (actual time=33.8..37.2 rows=50000 loops=1)
            -> Materialize with deduplication  (cost=0..0 rows=0) (actual time=33.8..33.8 rows=50000 loops=1)
                -> Filter: (cast(jt.`value` as unsigned) is not null)  (actual time=17.9..25 rows=50000 loops=1)
                    -> Materialize table function  (actual time=17.9..21.7 rows=50000 loops=1)
    -> Filter: (t.id = `<subquery2>`.`CAST(value AS UNSIGNED)`)  (cost=0.906 rows=1) (actual time=0.00271..0.00278 rows=1 loops=50000)
        -> Single-row index lookup on t using PRIMARY (id=`<subquery2>`.`CAST(value AS UNSIGNED)`)  (cost=0.906 rows=1) (actual time=0.00262..0.00264 rows=1 loops=50000)


The huge performance discrepancy may be the result of the IN version deciding to do a full table scan, whereas the JSON version did 50,000 single-row index lookups.  (One for each ID we’re looking up.) If you could nudge the IN version to do the same, it might perform about the same (but it may no longer involve writing IN anymore, heh).

It's also entirely possible a full table scan is the faster approach on some servers. And please don't take away from post that "JSON array lookup always faster than IN"!! It was faster in THIS test on THIS machine. It may not always be.

Tl;dr:

Consider using an array instead of IN list if you are binding parameters from JDBC (or similar SQL API). 

Most DBMSs support arrays, but MySQL only supports arrays in the context of JSON, so it's a little bit ugly in MySQL, but not terrible. See: https://pastebin.com/EV3VxwZ2 - make sure to keep the syntax sargable (i.e. such that it can use an index; the linked-to example does use the index and so it's nice and fast)

Edit: there are other techniques that may apply here as well, such as creating a temporary table, inserting your IDs to search for into it, then doing a semi-join or similar. This is a totally valid approach too, and many users may prefer it most of all. I was focused on single-statement solutions when I wrote this post, however I may edit this post in the future to add a demo for this approach, and perhaps compare performance as well. (Then we just need to figure out how to bind all the values in our insert(s). :))

Note: versions tested on were: PostgreSQL 16.3, MySQL 8.3.0, Oracle 19c. The first two were tested on a 2021 MacBook Pro with an M1 Max. Oracle was tested on a db.m5.large on RDS, the smallest instance that Amazon allows (I just wanted something that works)

Monday, March 18, 2024

Postgres trick: Referring to columns dynamically within static SQL

The goal in this demo is to refer to columns dynamically, without the help of some other language dynamically generating SQL. This is so we can avoid repeated, copy-pasted expressions across multiple, similar columns. We want to write the expression only once, ideally.

The solution is a little complicated, and is definitely Postgres-specific. but perhaps a similar technique can be used in any SQL engine that supports similar JSON functionality. (Edit: I pulled off a similar trick in Databricks/Spark SQL recently. Maybe that's worth a separate post.)

So starting with this table to work with:

select * from t; id | url_path1 | url_path2 | url_path3 | url_path4 | url_path5 ----+-----------+-----------+-----------+-----------+----------- 1 | /a/a/a | /b/b/b | /c/c/c | /d/d/d | /e/e/e 2 | /f/f/f | /g/g/g | /h/h/h | /i/i/i | /j/j/j (2 rows)

The idea is to convert to JSON so that you can perform a function (in this demo, I’ll be simply concatenating a ‘/’ character) only once for several columns. I want to leave id as-is, but perform the concatenation on url_path*

Then we’ll convert back to relational records in the end.

I’ll be using the demonstrative technique of running one query at a time, putting each into a CTE for the subsequent step and building on that until the end result…

Step 1: 

Convert all columns we want to modify, i.e. all except for id, to JSON, then to records consisting of a key and a value...

We could alternatively do a positive match for something like 'url_path%', but - 'id' works as well for this use case.

postgres=# select id, jsonb_each_text(to_jsonb(t) - 'id') as url_path_kv from t; id | url_path_kv ----+-------------------- 1 | (url_path1,/a/a/a) 1 | (url_path2,/b/b/b) 1 | (url_path3,/c/c/c) 1 | (url_path4,/d/d/d) 1 | (url_path5,/e/e/e) 2 | (url_path1,/f/f/f) 2 | (url_path2,/g/g/g) 2 | (url_path3,/h/h/h) 2 | (url_path4,/i/i/i) 2 | (url_path5,/j/j/j) (10 rows)

Step 2:

Now let’s separate the keys and values into varchar columns, then concatenate a ‘/’ to the values.

Concatenating ‘/’ is the change to the values of url_path* that we are making. This could be replaced with something complicated that we don’t want to repeat 5 times.

with step1_kv as ( select id, jsonb_each_text(to_jsonb(t) - 'id') as url_path_kv from t ) select id, (url_path_kv).key as key, (url_path_kv).value || '/' as val_fixed from step1_kv; id | key | val_fixed ----+-----------+----------- 1 | url_path1 | /a/a/a/ 1 | url_path2 | /b/b/b/ 1 | url_path3 | /c/c/c/ 1 | url_path4 | /d/d/d/ 1 | url_path5 | /e/e/e/ 2 | url_path1 | /f/f/f/ 2 | url_path2 | /g/g/g/ 2 | url_path3 | /h/h/h/ 2 | url_path4 | /i/i/i/ 2 | url_path5 | /j/j/j/ (10 rows)

Step 3:

Convert back to JSON using jsonb_object_agg() which is an aggregate function that makes JSON, essentially

with step1_kv as (
select id, jsonb_each_text(to_jsonb(t) - 'id') as url_path_kv from t
),
step2_kv_separate as (
select id, (url_path_kv).key as key, (url_path_kv).value || '/' as val_fixed
from step1_kv
)
select id, jsonb_object_agg(key, val_fixed) as url_path_jsonb
from step2_kv_separate
group by id;

id | url_path_jsonb
----+--------------------------------------------------------------------------------------------------------------------------
2 | {"url_path1": "/f/f/f/", "url_path2": "/g/g/g/", "url_path3": "/h/h/h/", "url_path4": "/i/i/i/", "url_path5": "/j/j/j/"}
1 | {"url_path1": "/a/a/a/", "url_path2": "/b/b/b/", "url_path3": "/c/c/c/", "url_path4": "/d/d/d/", "url_path5": "/e/e/e/"}

(2 rows)


Step 4:

Convert back to relational records.


with step1_kv as (
select id, jsonb_each_text(to_jsonb(t) - 'id') as url_path_kv from t
),
step2_kv_separate as (
select id, (url_path_kv).key as key, (url_path_kv).value || '/' as val_fixed
from step1_kv
),
step3_back_to_json as (
select id, jsonb_object_agg(key, val_fixed) as url_path_jsonb
from step2_kv_separate
group by id
)
select id, r.* --back to relational records
from step3_back_to_json
cross join lateral
jsonb_to_record(url_path_jsonb) as r(url_path1 varchar, url_path2 varchar, url_path3 varchar, url_path4 varchar, url_path5 varchar);

id | url_path1 | url_path2 | url_path3 | url_path4 | url_path5
----+-----------+-----------+-----------+-----------+-----------
1 | /a/a/a/ | /b/b/b/ | /c/c/c/ | /d/d/d/ | /e/e/e/
2 | /f/f/f/ | /g/g/g/ | /h/h/h/ | /i/i/i/ | /j/j/j/

(2 rows)


Now, you could easily argue the complexity isn’t worth it...but consider if you had url_path1 through url_path99 and several complex expressions to apply to each. :) Could be worth it!

And I want to say it may be possible to do so without so many conversion steps. I’m kind of a noob with these JSON functions. Plus, I wanted to break the problem down into clear and separate steps for demonstrative purposes.

Anyway, that’s what blew my mind when I learned it was even possible to refer to columns dynamically with such a JSON trick!

Saturday, February 03, 2024

A Bonkers SQL Feature!

It’s time for a “Bonkers SQL Feature” post:

A (IMO) bonkers standard SQL behavior is that newlines between string constants cause implicit concatenation. Spaces between them do not. And if you try a similar thing but with column names or variables names (i. e. not constants), there is no implicit concatenation. Use two column names in this manner, and you end up with the second column treated as an alias for the first. :)

A quick demo follows:

/* newlines concatenate */
postgres=# select 'a'
'b'
'c'
;
 ?column?
----------
 abc
(1 row)

/* spaces do not concatenate - causes syntax error */
postgres=# select 'a' 'b' 'c';
ERROR:  syntax error at or near "'b'"
LINE 1: select 'a' 'b' 'c';
                   ^

/* let's test newlines but with column names */
postgres=# create table t (a varchar, b varchar);
CREATE TABLE

postgres=# insert into t values('aaa','bbb');
INSERT 0 1
postgres=#
postgres=# select a
b
from t;
  b
-----
 aaa
(1 row)


Per the Postgres documentation, the behavior defined by standard SQL is, and kudos to them for calling it bizarre:

Two string constants that are only separated by whitespace with at least one newline are concatenated and effectively treated as if the string had been written as one constant.

(This slightly bizarre behavior is specified by SQL; PostgreSQL is following the standard.)

A potential danger in this behavior is if you break up a list using newlines but forget to write a comma, such as:

/* basic IN check */
postgres=# SELECT 'foo' IN ('foo','bar');
 ?column?
----------
 t
(1 row)

/* here 'bar' is on the second line and I forgot to add a comma, so the check becomes `'foo' IN ('foobar')` essentially, resulting in false
postgres=# SELECT 'foo' IN ('foo'
postgres(#  'bar');

 ?column?
----------
 f
(1 row)

MySQL does similarly, BUT horrifyingly to me, they decided to change “Two string constants that are only separated by whitespace with at least one newline” to “Two string constants that are only separated by whitespace.” In other words, it takes this oddball behavior one step further and does it for spaces as well, despite what the standard says. (Although perhaps it makes more intuitive sense not to differentiate between different kinds of whitespace, standard be damned??)

MySQL:

/* newline test is consistent with Standard and Postgres */
mysql> select
    -> 'a'
    -> 'b'
    -> 'c'
    -> ;
+-----+
| a   |
+-----+
| abc |
+-----+
1 row in set (0.00 sec)

/* space test - WAT!! */
mysql> select 'a' 'b' 'c';
+-----+
| a   |
+-----+
| abc |
+-----+
1 row in set (0.00 sec)

/* column name test is the same */
mysql> create table t (col1 varchar(3), col2 varchar(3));
Query OK, 0 rows affected (0.01 sec)

mysql> insert into t (col1, col2) values ('aaa', 'bbb');
Query OK, 1 row affected (0.01 sec)

mysql>
mysql> select col1
    -> col2
    -> from t;
+------+
| col2 |
+------+
| aaa  |
+------+
1 row in set (0.00 sec)

Not sure what else to say except...Be careful out there. :) 





 

Sunday, November 19, 2023

What have Oracle Materialized Views got that PostgreSQL ones don't?

This comes up once in a while. Just because DBMS x has a Feature ABC, doesn't mean it's the same as Feature ABC on DBMS y.

Postgres materialized views barely do anything. I was so excited when they were announced [version 9.3 in 2013] until I actually looked into them. They are basically something I could implement myself in like 30 minutes with a few pl/pgsql functions (and maybe a table to store the necessary metadata). They just create a table based on a given query and you can refresh the whole thing on demand (like a truncate/insert). Then you can query the materialized view like any other table. If you want to refresh them automatically, you attach the refresh call to some cron job or perhaps in triggers on all dependencies (creating a mess IMO).

Oracle materialized views can do some pretty great stuff:

  • Automatically refresh: You don't need to manually call any refresh function (although you CAN if you want to).
    • Refresh on a schedule - Oracle has a built-in scheduler and materialized views are integrated with that, so I can say when I create my MV to "refresh every night at 2am" and it just works. Not a massive selling point, but it's convenient not to need some external scheduler such as cron.
    • Refresh when dependencies (base tables) are updated: Say I create a MV based on a query such as - SELECT x, y, z FROM sales JOIN country ... WHERE country.id IN (...) - now whenever an INSERT/UPDATE/DELETE is run on sales or country, the MV is automatically updated. No call to any refresh function anywhere is necessary. But it's smart about it too - in this case since I'm only concerned about a specific list of countries (in my where clause), if I update a row having a country.id NOT in that in-list, no refresh on the MV will be triggered. So basically it only refreshes when it has to. Your mileage may vary in complex use cases.
  • Query rewrite aka use an MV as an automagic index: This lets you create a MV as sort of an index. The query doesn't need to know about the MV's existence, but if the MV can answer the question the query is asking, the query optimizer automatically reroutes the query to the MV. Example: An application is issuing the query SELECT sum(qty*price) FROM sales but it's gotten pretty slow as it's recomputing that total for millions of sales every time it is run. I can create a MV sales_mv on that exact query, and the query rewrite feature will automatically take the query from sales to sales_mv without any modification of the query or application being necessary. Rewrite can even work in many cases where the MV's query isn't an exact match. For example if I modify the above query to GROUP BY state, the MV now has each sum(qty*price) per state. If I then run a query like SELECT sum(qty*price) FROM sales WHERE state = 'CA', the query rewrite feature is smart enough to know that the MV is a superset of the data I requested, and it knows which subset of the MV can be used to fetch my desired results.
    • I can even provide the MV with dimension metadata that can help the query optimizer in understanding when rewrite can be applied. If I have that same sales query but broken down by month - so something like: SELECT sum(qty*price), month_num FROM sales group by month_num - I can create a dimension that declares that months roll up in to quarters and which ones. So now if I ask the database SELECT sum(qty*price) FROM sales where quarter='Q2' - if I've defined my dimensions properly the query optimizer can figure out that Q2 can be determined by adding up the totals from month_nums 4, 5 and 6 in the MV - a minimal amount of work compared to summing from the detail level up - and it finds me my answer that way.
  • "Fast" refresh: (Oracle calls it "fast" although I prefer the term incremental refresh.) Say that sales MV refresh takes 30 seconds because there's a lot of data. If we have a trickle-in of about a sale per minute, we're spending a lot of time and energy refreshing this whole thing repeatedly - we spend 30 seconds of every minute refreshing. But imagine if you as a human being had the job of keeping track of a sales total on paper. Throughout the day, a salesperson yells at you about a new sale, "We just got another sale of $100!" - you wouldn't recompute your entire sales total from the entire history every single time; rather you'd take your running total - let's say it's $10,000 - and simply add $100 to it - so now it's $10,100. So, that's analogous to how "fast refresh" works - just apply the increments. This feature can be pretty huge in speeding up refreshes.
    • One of many, many use cases: Multi-row constraints: If you've ever wished you could apply some validation across multiple rows in your table, you might be able to accomplish that with materialized views. Since you can add check constraints on a MV, you could define an MV to run some aggregation, and place a constraint on the result of that aggregation. For example, you have a table representing retirement fund allocations. Each customer's allocations must add up to 100%. So you could create a MV on the query SELECT SUM(percent_allocated) AS total_allocated, customer_id FROM allocation_table GROUP BY customer_id -- add a check constraint that total_allocated must equal 100 and now no bad data can ever creep into allocation_table. Now this may be too slow and clunky if the full MV query needed to be refreshed every time the base tables are updated, but again...fast refresh!
  • Oracle also lets you choose to defer applying the "deltas" from a fast refresh to read time of the materialized views. So if I would rather not slow down the writes to the base tables, by dealing with MV refresh at that time, I can tell Oracle to automatically queue up those changes and hold off on applying them to the materialized view until the time at which I actually read from it. So it's a tradeoff - if I have a large reporting query backed by MV, the user might not mind if I add to it, let's say, an average of half a second to it at read time. That may be better than adding a few milliseconds to every write to the underlying tables, if I have tight business constraints on the writes. So this feature allows me to do that, and it's easy and declarative - type a few magic words and it just works.
    • Deferring applying deltas analogy: to continue with the total sales on paper analogy: this is like keeping track of that new $100 sale on a separate paper (let's call it the delta log) that lists each new sale. Maybe you're so busy you don't have the mental capacity to perform addition constantly. Another $100 sale comes in, so you jot "$100" on the delta log. A $200 sale comes in, you write $200. A $1042.59 sale comes in...etc. Now your manager asks, "Hey I need the current total now please." So at THAT point, you catch up on the math: you add 10,000 + 100 + 200 + 1042.59 = 11,342.59, the new running total. Manager thanks you, you toss out the delta paper and start with a new one, repeat.
    • Caching joins: Now think about using fast refresh in conjunction with query write - that's a pretty powerful indexing feature. You could use it to precompute joins. (Everybody complains about how JOINS are slow and painful in SQL - Oracle actually has multiple solutions to pre-join tables. MVs are one of them.) If I have a billion WIDGETs, each of which was made in a FACTORY (linked to by WIDGET.FACTORY_ID), maybe queries that join the two are getting slow. So I can precompute the joins and query rewrite will potentially be able to use the MV for my slow queries, without anyone having to modify a query. And whenever a WIDGET or FACTORY is updated/inserted/deleted, fast refresh applies only the deltas so the whole thing isn't refreshed.
    • So many potential use cases here. Sometimes I see the argument that Postgres has partial indexes, and Oracle doesn't, so isn't Oracle lame? For context, a partial index lets you create an index with a where clause to pre-filter out a bunch of data you don't care about for the scope of a query/queries. So for example CREATE INDEX my_index ON order(id, order_date, etc.) WHERE status = 'ACTIVE';Now if you have 99% inactive orders at any given time, the index can potentially be used to search the active ones more quickly. Great PG feature btw, but with Oracle you can accomplish essentially same thing by creating a MV: CREATE MATERIALIZED VIEW ... SELECT id, order_date, etc. WHERE status = 'ACTIVE' just make sure you enable fast refresh and query rewrite and it's pretty much the same thing. (You can even define the MV as "index organized" to make it physically shaped like a B*Tree index, rather than a heap table, to make it even closer to "the same thing.") Sure you could argue in this use case, the Oracle solution is more complex, but the point is this one feature can be applied in tons of use cases - I could come up with some all day. :)
  • Stale tolerance: this goes hand and hand with query rewrite. If I am using query rewrite, I can define whether staleness is tolerated. So imagine I am running SELECT sum(qty*price) FROM sales again, and I've defined it to refresh on a nightly schedule (not automatically when dependencies are updated). This means at any given moment during the day, the MV's contents may be "stale" if updates/inserts/deletes have happened since refresh. So I can decide on a per-session basis (if desired) if I'd like query rewrite to route me to the MV when the MV is stale. If I need real-time data, run the normal sum against the guaranteed-to-be-up-to-date sales table instead even though it may be slower. If last night's data is good enough, I declare that stale tolerated is OK and I get to use the MV that way.
  • Synchronized Materialized View Groups: Basically you can group and refresh many materialized views together to guarantee that they are in sync. Think atomicity.
  • Plays well with partitions/subpartitions.

There's more, but that's the gist.

All that said - there's a neat extension that at least adds incremental updates to Postgres MVs. It's called pg_imv: https://github.com/sraoss/pg_ivm

My most upvoted r/sql comment of all time

I try to provide meaningful insight in my comments on r/sql...

But the funny thing is my upvoted comment of all time on that subreddit is simply: