Sunday, September 29, 2024

Arrays vs. IN-lists in SQL Using JDBC: Techniques and a Glance Under the Hood

Intro:

Alright, so say you want to run a query that has the WHERE clause WHERE id IN (...) from your Java program, using JDBC, where the content of your IN list is a variable list of integers. There could be one of them; could be 100,000.

Technique 1: Concatenation (👎)

One way to do it is to concatenate all the integers, separated by commas, directly into the query string. It is probably safe from SQL injection in this case, so long as we are sure they are all integers, but there may be other downsides depending on the DBMS and configuration. For example, many statement caches will treat each variation as a distinct query, and essentially flood the cache with duplicates. (This is how Oracle's library cache will do it by default. YMMV.)

Personally, I’m a purist. It’s against my personal belief system (😅) to concatenate parameters ever, unless I have no other option. I’ll omit the rant on that topic for now. But let’s just accept this premise - never concatenate SQL parameters - for the scope of this exercise. 

(OK I lied a little. The short version of the rant is I believe concatenating parameters is like entering a minefield. While you may be able to safely navigate the minefield in some cases, it's best never to go into it when better alternatives exist. Also, consider that others may follow by example, and not understand the nuances of how you safely did it!)

So given that premise, this technique must be thrown away. (Although in the real world, it's the most common one I've seen!)

Technique 2: Parameterize the IN list

A better way to do it is to take your integer list, generate the right number of ?s (or whichever symbol) you need, but that also has its issues, including the aforementioned duplication of queries in a statement cache (one for id IN (?), another for id IN (?,?), yet another for id IN (?,?,?) , etc. Another potential issue: some database software will limit the number of parameters you can bind. And this just feels cumbersome IMO. You write something like a loop, iterating n times to concatenate the right number (n) of question marks in the query string, then bind parameters in a loop, also n times. 

A partial example of how to accomplish this:

// generating the comma-separated list of question marks
String
placeholders = Arrays.stream(ids) =
                            .map(id -> "?")
                            .collect(Collectors.joining(", "));

// stick those placeholders into the query
String sql = "SELECT * FROM t WHERE id IN (" + placeholders + ")";

// bind parameters in a loop
for
(int i = 0; i < ids.length; i++) {
    pstmt.setInt(i + 1, ids[i]);
}

This is much better than concatenating the values directly IMO. At least we are parameterizing.

Technique 3: Use a Single Array Parameter

What if we were to write our query to work with a single array containing all the integers we want to search for? (Syntax varies per DBMS)

MySQL

If you are working with MySQL, unfortunately MySQL doesn’t even have native arrays (SQL:99 contains the spec for arrays. You would think this would be a reasonable thing that all major DBMSs should have implemented by 2024!!). However MySQL does at least have JSON objects which, of course, can contain arrays. So maybe we could create a JSON array containing the integers we want to search for.

Let's give it a shot. Self-imposed requirement: the query must be sargable, meaning an index lookup is possible.

I found it can be done, although it’s not super pretty. Not terrible, but not gorgeous. It looks like this: https://pastebin.com/EV3VxwZ2 

The most important parts of this code example are:

  • Line 11: defining the array of data to set as parameter; notice this is just a string
  • Lines 14-16: the query string
  • Line 19: setting the parameter

Again, this is a sargable query, meaning an index on the column id can be, and is according to my testing (see the execution plans section below), used for the search. A rule of thumb - not a perfect one - is if you’ve got a search expression like colname = <anything>, with an index on colname, it’s sargable. If you instead put your lookup column in the function call as a parameter, for example my_function(id, ...), then it’s not. (I’m glossing over some exceptions I’ve seen, and also functional/expression indexes.)

There are other ways you could form this query that may be easier on the eyes, but they are unfortunately NOT sargable and therefore force a full table scan, for example:

SELECT * FROM t WHERE JSON_CONTAINS(?, CAST(id AS JSON)); 

So, I can’t recommend this method unfortunately. Maybe there is another solution to do it with JSON that is more handsome that I haven’t thought of.

It’s worth noting that if you omit the CAST on value in the subquery, it breaks sargability. Why? Because the value returned by JSON_TABLE is of type JSON. Even though MySQL can handle this without error, it fails to use an index in that case. (Similar to how, if you run something like WHERE varchar_id IN (123, 456), it breaks sargability. See this past post for a demonstration.)

PostgreSQL

Why isn't it this easy on every DBMS?!? Here is the Postgres version of the Java code. https://pastebin.com/PYNxSbHU (Lines 11-15 are the important parts in this example.)

Much prettier; zero hackiness. Simply define your Integer[] in Java, then write your query:

SELECT * FROM t WHERE id = ANY(?);

MySQL Execution Plans: IN List vs. Array

In MySQL, the plans are not at all the same between the IN query and the array one, but both are about equally “very fast” with a small set of inputs.

mysql> /* IN query */
EXPLAIN ANALYZE
SELECT * FROM t
WHERE id IN (1,2,3,4,5, 100, -999, 9999, 9941394);

 -> Filter: (t.id in (1,2,3,4,5,100,<cache>(-(999)),9999,9941394))  (cost=8.47 rows=9) (actual time=0.524..0.695 rows=8 loops=1)
    -> Index range scan on t using PRIMARY over (id = -999) OR (id = 1) OR (7 more)  (cost=8.47 rows=9) (actual time=0.521..0.69 rows=8 loops=1)

1 row in set (0.01 sec)

mysql> /* JSON query */
EXPLAIN ANALYZE
SELECT * FROM t WHERE id IN (
    SELECT CAST(value AS UNSIGNED)
    FROM JSON_TABLE(
        '[1,2,3,4,5, 100, -999, 9999, 9941394]',
        '$[*]' COLUMNS (value JSON PATH '$')) AS jt
    );

 -> Nested loop inner join  (cost=1.3 rows=2) (actual time=0.214..0.397 rows=8 loops=1)
    -> Filter: (<subquery2>.CAST(value AS UNSIGNED) is not null)  (cost=0.2..0.4 rows=2) (actual time=0.0855..0.0875 rows=9 loops=1)
        -> Table scan on <subquery2>  (cost=2.5..2.5 rows=0) (actual time=0.0838..0.085 rows=9 loops=1)
            -> Materialize with deduplication  (cost=0..0 rows=0) (actual time=0.0834..0.0834 rows=9 loops=1)
                -> Filter: (cast(jt.value as unsigned) is not null)  (actual time=0.0611..0.0637 rows=9 loops=1)
                    -> Materialize table function  (actual time=0.0538..0.0551 rows=9 loops=1)
    -> Filter: (t.id = <subquery2>.CAST(value AS UNSIGNED))  (cost=0.802 rows=1) (actual time=0.0338..0.0339 rows=0.889 loops=9)
        -> Single-row index lookup on t using PRIMARY (id=<subquery2>.CAST(value AS UNSIGNED))  (cost=0.802 rows=1) (actual time=0.0335..0.0336 rows=0.889 loops=9)
 |
1 row in set (0.01 sec)


I suspect the situation may be different with a large set of inputs.

Postgres Execution Plans: IN list vs. Array

The Postgres execution plan for the two queries follow:

postgres=# EXPLAIN ANALYZE SELECT * FROM child WHERE id = ANY(ARRAY[5, 10, 11, 200]);
                                                     QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
 Index Scan using child_pkey on child  (cost=0.43..21.81 rows=4 width=27) (actual time=0.094..0.537 rows=4 loops=1)
   Index Cond: (id = ANY ('{5,10,11,200}'::integer[]))
 Planning Time: 3.148 ms
 Execution Time: 0.568 ms
(4 rows)

postgres=# EXPLAIN ANALYZE SELECT * FROM child WHERE id IN (5, 10, 11, 200);
                                                     QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
 Index Scan using child_pkey on child  (cost=0.43..21.81 rows=4 width=27) (actual time=0.069..0.086 rows=4 loops=1)
   Index Cond: (id = ANY ('{5,10,11,200}'::integer[]))
 Planning Time: 0.329 ms
 Execution Time: 0.122 ms
(4 rows) 



Postgres doesn't care! There's no funky quirk or logical/performance discrepancy to worry about between the two, as Postgres translates both queries to the exact same execution plan.

This allows us to avoid any worry about how these two queries might differ performance-wise, and focus on the style that is cleanest to us. Awesome job, Postgres dev team!

How about Oracle?

I started investigating this subject in Oracle when a r/sql user said their IN list maxed out at 1000 values (an Oracle limitation). What I found is if you use the array technique with Oracle over JDBC with a bound parameter, there's no problem at 100,000 values (I don't know if there's truly "no limit", but whatever limit there might be is > 100,000)

So I whipped up the same kind of demo, binding an array. Notice the ids array is sized at 100,000 and populated in a loop: https://pastebin.com/91eEbSHx (Important parts: lines 16-18, 22 and 25)

Tiny point of annoyance: the object type for the NUM_ARRAY used in this Oracle example must be created ahead of time: CREATE OR REPLACE TYPE NUM_ARRAY IS TABLE OF NUMBER; (There are other object types you could use in Oracle as well.)

MySQL Performance Test: search by a large number of ints, IN vs. array

Ok MySQL performance test with a BIG number of IDs to search for and… I was not expecting this!!!

Repeated multiple trials of the JSON array lookup vs. old-fashioned ID search and average times: 

JSON: average time to execute is slightly less than 0.2 sec
IN: average time to execute is about 2.6 sec

Additional info. Table size, ~13M rows, and not a very wide table:


mysql> select count(*) from t;
+----------+
| count(*) |
+----------+
| 13315201 |
+----------+

mysql> describe t;
+-------+------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra          |
+-------+------+------+-----+---------+----------------+
| id    | int  | NO   | PRI | NULL    | auto_increment |
| x     | int  | YES  |     | NULL    |                |
+-------+------+------+-----+---------+----------------+

List of IDs to search for generated by this seq command on MacOS (should run out of the box on any Linux machine as well):

seq --format '%f' -s, 1 100 5000000

Which means: generate integers from 1 through 5,000,000 in increments of 100 

If you own code that is doing an IN search by a large list of IDs, maybe give a JSON array search a try. No guarantees, but you may get big gains, bro!

The respective execution plans:

IN:
 -> Filter: (t.id in (1,101,201,301,401,501,601,<snip>
,4998701,4998801,4998901,4999001,4999101,4999201,4999301,4999401,4999501,4999601,4999701,4999801,4999901))  (cost=1.35e+6 rows=6.65e+6) (actual time=2.1..2424 rows=50000 loops=1)
    -> Table scan on t  (cost=1.35e+6 rows=13.3e+6) (actual time=2.1..1462 rows=13.3e+6 loops=1)


JSON array:

| -> Nested loop inner join  (cost=1.41 rows=2) (actual time=35.9..184 rows=50000 loops=1)
    -> Filter: (`<subquery2>`.`CAST(value AS UNSIGNED)` is not null)  (cost=0.2..0.4 rows=2) (actual time=33.8..39.3 rows=50000 loops=1)
        -> Table scan on <subquery2>  (cost=2.5..2.5 rows=0) (actual time=33.8..37.2 rows=50000 loops=1)
            -> Materialize with deduplication  (cost=0..0 rows=0) (actual time=33.8..33.8 rows=50000 loops=1)
                -> Filter: (cast(jt.`value` as unsigned) is not null)  (actual time=17.9..25 rows=50000 loops=1)
                    -> Materialize table function  (actual time=17.9..21.7 rows=50000 loops=1)
    -> Filter: (t.id = `<subquery2>`.`CAST(value AS UNSIGNED)`)  (cost=0.906 rows=1) (actual time=0.00271..0.00278 rows=1 loops=50000)
        -> Single-row index lookup on t using PRIMARY (id=`<subquery2>`.`CAST(value AS UNSIGNED)`)  (cost=0.906 rows=1) (actual time=0.00262..0.00264 rows=1 loops=50000)


The huge performance discrepancy may be the result of the IN version deciding to do a full table scan, whereas the JSON version did 50,000 single-row index lookups.  (One for each ID we’re looking up.) If you could nudge the IN version to do the same, it might perform about the same (but it may no longer involve writing IN anymore, heh).

It's also entirely possible a full table scan is the faster approach on some servers. And please don't take away from post that "JSON array lookup always faster than IN"!! It was faster in THIS test on THIS machine. It may not always be.

Tl;dr:

Consider using an array instead of IN list if you are binding parameters from JDBC (or similar SQL API). 

Most DBMSs support arrays, but MySQL only supports arrays in the context of JSON, so it's a little bit ugly in MySQL, but not terrible. See: https://pastebin.com/EV3VxwZ2 - make sure to keep the syntax sargable (i.e. such that it can use an index; the linked-to example does use the index and so it's nice and fast)

Edit: there are other techniques that may apply here as well, such as creating a temporary table, inserting your IDs to search for into it, then doing a semi-join or similar. This is a totally valid approach too, and many users may prefer it most of all. I was focused on single-statement solutions when I wrote this post, however I may edit this post in the future to add a demo for this approach, and perhaps compare performance as well. (Then we just need to figure out how to bind all the values in our insert(s). :))

Note: versions tested on were: PostgreSQL 16.3, MySQL 8.3.0, Oracle 19c. The first two were tested on a 2021 MacBook Pro with an M1 Max. Oracle was tested on a db.m5.large on RDS, the smallest instance that Amazon allows (I just wanted something that works)