Thursday, June 13, 2019

Clusters, Clusters Everywhere!


Clusters: what an overloaded term in the database world! Whenever someone talks to me about a database cluster, or a clustered index or clustered table, I have to stop and think for a moment to figure out what they mean. I am going to attempt to list all of the contexts in which the term “cluster” or “clustered” apply that I can think of, in no particular (or logical) order:

PostgreSQL Cluster: In Postgres, a cluster is simply your database storage area on disk. It is a collection of databases. The hierarchy is that a cluster has many databases that have many schemas that have many relations and other objects. A Postgres Cluster is more or less synonymous with a specific installation of Postgres.

Catalog Cluster: The SQL standard definition. Essentially the same as the PostgreSQL Cluster but generic. (I hadn’t actually heard of this one before except in the Postgres documentation in which they define a Postgres Cluster.)

Clustered Index and Non-clustered Index (MS SQL Server): In SQL Server, a clustered index refers to the table itself being physically shaped like an index. The index stores the entire row, sorted by key, in this index structure. A non-clustered index in Microsoft Land contains the key values in the index, but the rest of the row resides elsewhere in a heap or clustered table (see below)

MySQL (InnoDB) has something much like this too, but you don’t have a choice to use heap tables as an alternative. In Oracle they are called Index-Organized Tables and heap tables are the default.

Advantages: looking up by primary key is faster.

Disadvantages: inserts are slower.

There are more nuances, both pros and cons, but that's the gist of it. (One thing to consider is if your Primary Key is a UUID vs. a standard auto-increment integer, but I'll let you Google that. :))
 
Clustered Table (MS SQL Server): A table that has a clustered index (above) is called a clustered table.

Clustered Index (Postgres): An index that indicates the sort order of the table on which it is built. You first create your table as usual. Then create a clustered index on the columns you want to sort by. Populate the table. Then run the CLUSTER command to sort the table. Note that unlike the Microsoft clustered index above, a Postgres clustered index does NOT store the entire row's data in the index structure.

Advantages: this may speed up queries that use table scans with the sorted columns in the where clause. Check out the top answer on this stackexchange question - it talks about looking up by a date range on such a sorted table and the performance benefits - https://dba.stackexchange.com/questions/39589/optimizing-queries-on-a-range-of-timestamps-two-columns

Some more info from the PG documentation: "In cases where you are accessing single rows randomly within a table, the actual order of the data in the table is unimportant. However, if you tend to access some data more than others, and there is an index that groups them together, you will benefit from using CLUSTER. If you are requesting a range of indexed values from a table, or a single indexed value that has multiple rows that match, CLUSTER will help because once the index identifies the table page for the first row that matches, all other rows that match are probably already on the same table page, and so you save disk accesses and speed up the query."

Disadvantages: You need to maintain your sort order by periodically re-running the CLUSTER command, and a table lock is held on the table during the process.

I once had a summary table that was completely rebuilt every night during a maintenance period, and the CLUSTER command was run afterward. So in a case like this having to run the CLUSTER command is not much of an issue.

The CLUSTER command (also Postgres): This pairs up with the Clustered Index for Postgres above. It is simply the command you issue to perform the sort - i.e. to re-order the table according to the Clustered Index.

Table Clusters (Oracle): A group of tables that share common columns and store related data in the same blocks on disk. Essentially this is a way to physically colocate tables that are joined frequently, while maintaining the logical distinction between the tables. Table clusters can be used to reduce disk I/O and improve access times for joins. They can reduce storage requirements too, as the cluster key value is not stored repeatedly for each row.

You can also put just a single table in an Oracle Table Cluster - so all the data would be grouped by key in its physical home on disk. This may allow some queries to perform faster. See this old AskTom post.

Cluster Index (Oracle): A Cluster Index is used to support a Table Cluster (above). "To locate a row in a cluster, the cluster index is used to find the cluster key value, which points to the data block associated with that cluster key value. Therefore, Oracle accesses a given row with a minimum of two I/Os--possibly more, depending on the number of levels that must be traversed in the index." -Oracle docs

Hash Clusters (Oracle): (Oh dear lord, it keeps going with Oracle, doesn't it?) I'm just going to quote Tom Kyte on this one:

Hash clusters are useful when you always access the data by primary key. For example, say you have some data in a table T and you ALWAYS query:

select * from T where id = :x;

That might be a good candidate for a hash cluster since there will be no index needed. Oracle will hash the value of :x into a physical address and go right to the data. No index range scan, just a table access by the hash value

The “common sense” definition of a Cluster: this is what I think most people are talking about when they mention a “mysql cluster” or a “database cluster." They are talking about a group of several servers, each running a database service, probably with some sort of replication set up, working in tandem to achieve load balancing and high availability. This is more a general concept than a specific feature/technology.

Real Application Cluster (aka RAC): A trademark of the Oracle Corporation, is an Oracle-specific implementation of the “common sense” definition of a cluster. This feature lets you use many Oracle instances (an instance is a set of processes and memory areas) together in a “shared everything” architecture. Each server has its own copy of the database (database referring to the data files, essentially) and they are kept in sync automatically.

Confused yet?? :) Actually for me, writing out these definitions helps me become a little less confused. Hope it helps you, too! A question for my vast readership out there -- is there any other "cluster" I missed? I am sure there are many, many implementations of the "common sense" cluster, but that's ok, we don't need to list every one. :) Also I realize some of these features are supported by databases I didn't mention such as Sybase. Again, no need to list every one.

Ok, until next time, keep clustering! (Or something...) 




No comments: