Mark W's SQL Blog: 2023

Sunday, November 19, 2023

My r/sql comment on what do Oracle Materialized Views got that PostgreSQL ain't be...gotten?

Postgres materialized views barely do anything. I was so excited when they were announced [version 9.3 in 2013] until I actually looked into them. They are basically something I could implement myself in like 30 minutes with a few pl/pgsql functions (and maybe a table to store the necessary metadata). They just create a table based on a given query and you can refresh the whole thing on demand (like a truncate/insert). Then you can query the materialized view like any other table. If you want to refresh them automatically, you attach the refresh call to some cron job or perhaps in triggers on all dependencies (creating a mess IMO).

Oracle materialized views can do some pretty great stuff:

Automatically refresh: You don't need to manually call any refresh function (although you CAN if you want to).
- Refresh on a schedule - Oracle has a built-in scheduler and materialized views are integrated with that, so I can say when I create my MV to "refresh every night at 2am" and it just works. Not a massive selling point, but it's convenient not to need some external scheduler such as cron.
- Refresh when dependencies (base tables) are updated: Say I create a MV based on a query such as - SELECT x, y, z FROM sales JOIN country ... WHERE country.id IN (...) - now whenever an INSERT/UPDATE/DELETE is run on sales or country, the MV is automatically updated. No call to any refresh function anywhere is necessary. But it's smart about it too - in this case since I'm only concerned about a specific list of countries (in my where clause), if I update a row having a country.id NOT in that in-list, no refresh on the MV will be triggered. So basically it only refreshes when it has to. Your mileage may vary in complex use cases.
Query rewrite aka use an MV as an automagic index: This lets you create a MV as sort of an index. The query doesn't need to know about the MV's existence, but if the MV can answer the question the query is asking, the query optimizer automatically reroutes the query to the MV. Example: An application is issuing the query SELECT sum(qty*price) FROM sales but it's gotten pretty slow as it's recomputing that total for millions of sales every time it is run. I can create a MV sales_mv on that exact query, and the query rewrite feature will automatically take the query from sales to sales_mv without any modification of the query or application being necessary. Rewrite can even work in many cases where the MV's query isn't an exact match. For example if I modify the above query to GROUP BY state, the MV now has each sum(qty*price) per state. If I then run a query like SELECT sum(qty*price) FROM sales WHERE state = 'CA', the query rewrite feature is smart enough to know that the MV is a superset of the data I requested, and it knows which subset of the MV can be used to fetch my desired results.
- I can even provide the MV with dimension metadata that can help the query optimizer in understanding when rewrite can be applied. If I have that same sales query but broken down by month - so something like: SELECT sum(qty*price), month_num FROM sales group by month_num - I can create a dimension that declares that months roll up in to quarters and which ones. So now if I ask the database SELECT sum(qty*price) FROM sales where quarter='Q2' - if I've defined my dimensions properly the query optimizer can figure out that Q2 can be determined by adding up the totals from month_nums 4, 5 and 6 in the MV - a minimal amount of work compared to summing from the detail level up - and it finds me my answer that way.
"Fast" refresh: (Oracle calls it "fast" although I prefer the term incremental refresh.) Say that sales MV refresh takes 30 seconds because there's a lot of data. If we have a trickle-in of about a sale per minute, we're spending a lot of time and energy refreshing this whole thing repeatedly - we spend 30 seconds of every minute refreshing. But imagine if you as a human being had the job of keeping track of a sales total on paper. Throughout the day, a salesperson yells at you about a new sale, "We just got another sale of $100!" - you wouldn't recompute your entire sales total from the entire history every single time; rather you'd take your running total - let's say it's $10,000 - and simply add $100 to it - so now it's $10,100. So, that's analogous to how "fast refresh" works - just apply the increments. This feature can be pretty huge in speeding up refreshes.
- One of many, many use cases: Multi-row constraints: If you've ever wished you could apply some validation across multiple rows in your table, you might be able to accomplish that with materialized views. Since you can add check constraints on a MV, you could define an MV to run some aggregation, and place a constraint on the result of that aggregation. For example, you have a table representing retirement fund allocations. Each customer's allocations must add up to 100%. So you could create a MV on the query SELECT SUM(percent_allocated) AS total_allocated, customer_id FROM allocation_table GROUP BY customer_id -- add a check constraint that total_allocated must equal 100 and now no bad data can ever creep into allocation_table. Now this may be too slow and clunky if the full MV query needed to be refreshed every time the base tables are updated, but again...fast refresh!
Oracle also lets you choose to defer applying the "deltas" from a fast refresh to read time of the materialized views. So if I would rather not slow down the writes to the base tables, by dealing with MV refresh at that time, I can tell Oracle to automatically queue up those changes and hold off on applying them to the materialized view until the time at which I actually read from it. So it's a tradeoff - if I have a large reporting query backed by MV, the user might not mind if I add to it, let's say, an average of half a second to it at read time. That may be better than adding a few milliseconds to every write to the underlying tables, if I have tight business constraints on the writes. So this feature allows me to do that, and it's easy and declarative - type a few magic words and it just works.
- Deferring applying deltas analogy: to continue with the total sales on paper analogy: this is like keeping track of that new $100 sale on a separate paper (let's call it the delta log) that lists each new sale. Maybe you're so busy you don't have the mental capacity to perform addition constantly. Another $100 sale comes in, so you jot "$100" on the delta log. A $200 sale comes in, you write $200. A $1042.59 sale comes in...etc. Now your manager asks, "Hey I need the current total now please." So at THAT point, you catch up on the math: you add 10,000 + 100 + 200 + 1042.59 = 11,342.59, the new running total. Manager thanks you, you toss out the delta paper and start with a new one, repeat.
- Caching joins: Now think about using fast refresh in conjunction with query write - that's a pretty powerful indexing feature. You could use it to precompute joins. (Everybody complains about how JOINS are slow and painful in SQL - Oracle actually has multiple solutions to pre-join tables. MVs are one of them.) If I have a billion WIDGETs, each of which was made in a FACTORY (linked to by WIDGET.FACTORY_ID), maybe queries that join the two are getting slow. So I can precompute the joins and query rewrite will potentially be able to use the MV for my slow queries, without anyone having to modify a query. And whenever a WIDGET or FACTORY is updated/inserted/deleted, fast refresh applies only the deltas so the whole thing isn't refreshed.
- So many potential use cases here. Sometimes I see the argument that Postgres has partial indexes, and Oracle doesn't, so isn't Oracle lame? For context, a partial index lets you create an index with a where clause to pre-filter out a bunch of data you don't care about for the scope of a query/queries. So for example CREATE INDEX my_index ON order(id, order_date, etc.) WHERE status = 'ACTIVE';Now if you have 99% inactive orders at any given time, the index can potentially be used to search the active ones more quickly. Great PG feature btw, but with Oracle you can accomplish essentially same thing by creating a MV: CREATE MATERIALIZED VIEW ... SELECT id, order_date, etc. WHERE status = 'ACTIVE' just make sure you enable fast refresh and query rewrite and it's pretty much the same thing. (You can even define the MV as "index organized" to make it physically shaped like a B*Tree index, rather than a heap table, to make it even closer to "the same thing.") Sure you could argue in this use case, the Oracle solution is more complex, but the point is this one feature can be applied in tons of use cases - I could come up with some all day. :)
Stale tolerance: this goes hand and hand with query rewrite. If I am using query rewrite, I can define whether staleness is tolerated. So imagine I am running SELECT sum(qty*price) FROM sales again, and I've defined it to refresh on a nightly schedule (not automatically when dependencies are updated). This means at any given moment during the day, the MV's contents may be "stale" if updates/inserts/deletes have happened since refresh. So I can decide on a per-session basis (if desired) if I'd like query rewrite to route me to the MV when the MV is stale. If I need real-time data, run the normal sum against the guaranteed-to-be-up-to-date sales table instead even though it may be slower. If last night's data is good enough, I declare that stale tolerated is OK and I get to use the MV that way.
Synchronized Materialized View Groups: Basically you can group and refresh many materialized views together to guarantee that they are in sync. Think atomicity.
Plays well with partitions/subpartitions.

There's more, but that's the gist.

Source (myself): https://old.reddit.com/r/PostgreSQL/comments/kexm13/people_who_used_other_databases_before_does/gg5m6cr/

My most upvoted r/sql comment of all time

I try to provide meaningful insight in my comments on r/sql...

But the funny thing is my upvoted comment of all time on that subreddit is simply:

My answer to the question on r/sql: "What is the most common SQL mistake you seen get made by folks?"

Concatenating user inputs instead of binding parameters. Then when a bug ticket comes in for customer O'Reilly (made-up example), instead of replacing the technique of concatenation with binding parameters (aka parameterization), writing a bug fix that escapes quotes. 🤦

I give talks at my company about SQL and how to use it effectively - I gave one just about this subject recently. The short version of an analogy I like (the long version involves a live demo) is that in Java coding, if you wanted to write a basic, command line program that takes an argument, you would use Java's String[] args parameter to the main method. You would NOT bypass this correct way to do it, instead using a special macro in the source code, whose value you replace with a tool such as sed before compilation. It's intuitively silly, and it unnecessarily forces you to enter a minefield that you must navigate - checking for dangerous special characters such as semicolons and comment characters that could lead to bugs, or even worse, the dreaded "Java injection". (Better sanitize that input! /s) In the live demo, I show that by using this ridiculous technique, one could delete a file by passing some file deletion code as an argument. Of course, you shouldn't have to do this - just pass your arbitrary argument to the program by the perfectly safe means provided to us by Java: String[] args and be done with it. Doesn't matter if the arg contains semicolons, comments, file.delete(); or anything you can think of. It is setting the value for a variable, as opposed to dynamically mixing the value with the source code. It is safe.

Somehow, it's less widely known, but it's the same situation in SQL. Think of your SQL query like source code - it gets parsed and a binary query object is produced, just like a Java source file being compiled to a class file. (Details vary per DBMS.) To plug in parameters, you should use the means provided to us: namely, parameterization via JDBC, ODBC or whatever. Do NOT modify the SQL "source code" to plug in parameters, just as you wouldn't do so in the Java analogy above. When you parameterize correctly, you do not have to worry about escaping quotes, handling semicolons, nor do you have to check for special keywords like DROP TABLE, or otherwise "sanitize your inputs" (as the famous Bobby Tables xkcd comic unfortunately suggests) in any way whatsoever. It's the exact same thing as the Java case. "SQL injection" should be as ridiculous of a phrase as "Java injection" was above. (I'm aware of dynamic SQL but I'm omitting nuance for the sake of brevity, and dynamically constructing queries based on arbitrary user inputs is far less common than passing in parameter values.) Parameterization is setting the value for a variable, as opposed to dynamically mixing the value with the source code. It is safe.

That's in the domain of writing application SQL. As someone who reviews MySQL and Snowflake schema changes for about 200 developers, in that department the most common general mistake is a lax attitude toward testing. When I ask a developer to provide evidence of what the state of the relevant table(s) looked like before and after their proposed change, I often get pushback: "What? I'm just modifying a varchar column's max length, why do I need to test that?"

So here's a demo of the single most common error I see regarding THAT specific thing in MySQL. Accidentally wiping out column attributes because those attributes must be re-specified! https://dbfiddle.uk/vmpGShfp

But this isn't the same in every SQL database. (Another general attitude that is totally mistaken is that SQL is SQL and is the same everywhere. I could rant about THIS subject for hours.) But let me just say for now, the above MySQL "gotcha" does not exist in, for example, Postgres. Here is a similar demo using Postgres. Note that in Postgres, extending the varchar length DOES NOT destroy the NOT NULL constraint, default value, etc.: https://dbfiddle.uk/OB8hRoJv

Source, myself on: https://old.reddit.com/r/SQL/comments/17wo0tr/what_is_the_most_common_sql_mistake_you_seen_get/k9i99x0/

MySQL Booby Trap Alert! The big MySQL gotcha when modifying a column's data type

Here’s a demo of that really important MySQL gotcha about side effects when modifying a column, and a case against “SQL is pretty much the same everywhere” :)

Say I want to expand a VARCHAR(30) to a VARCHAR(40), on a column that has a NOT NULL constraint, a default value, and perhaps other attributes.

Demo: MySQL edition: https://dbfiddle.uk/vmpGShfp
Demo: PostgreSQL edition: https://dbfiddle.uk/OB8hRoJv

As you can see form the above demo, in MySQL it's not enough to simply run a command like alter table t modify a varchar(40); You also need to respecify the NOT NULL constraint, the default value if any, and other stuff (including non-default character set and collation!)

Now imagine I’m using a generic SQL script, or a tool like Liquibase that’s supposed to be “database agnostic”…say you initially have Postgres and this change doesn’t mangle anything. then you switch to MySQL and re-run the same scripts: mangle city!

Yeah you’ll get different results, and Liquibase certainly doesn’t hash this out for you automatically. (I've checked!)

The question is: which other databases also have this behavior?