Snowflake for SQL Server Users - Part 17 - Data clustering

In SQL Server, most tables benefit from having a clustering key i.e. the column or columns that the table is logically sorted by. (Note: much old training material and documentation used to say “physical order” and that’s never been true). To do that, SQL Server stores the data within the pages in a sorted way, and maintains a doubly-linked logical chain of pages. In addition, it maintains pointers to the data by creating an index on it.

By comparison, Snowflake’s design tends to encourage you to avoid creating clustering keys on tables. For most tables, the data will automatically be well-distributed across the micropartitions.

You can assess how well that’s working with seeing if there are multiple partitions with overlapping values. You can see an image of that in the documentation here:

https://docs.snowflake.net/manuals/user-guide/tables-clustering-micropartitions.html#clustering-depth-illustrated

The more overlap there is, the greater the clustering depth. You can see that by querying the SYSTEM$CLUSTERING_DEPTH system function. It’s also possible to see all the most important clustering details (including the depth) by querying the SYSTEM$CLUSTERING_INFORMATION system function.

Sometimes you need to cluster

This is all good, but sometimes you need to cluster a table, particularly if you’ve had a lot of INSERT, UPDATE, DELETE style operations that have been executed. (Note: this is only an issue if it’s changed the columns that were involved in the auto-clustering and has introduced skew and additional clustering depth).

The aim of the clustering (which, as noted, isn’t for all tables) is co-locate the related table data the same micro-partitions, and to minimize clustering depth.

Generally, you’ll only do this for very large tables. And sure signs that you need to do it are where your queries have slowed down markedly, and the clustering depth has increased.

Micropartition pruning is eliminating micropartitions that aren’t needed for a particular query. It’s somewhat like partition elimination and clustered columnstore segment elimination in SQL Server. And it’s really important for query performance.

But there’s a cost

While it might seem obvious to then keep all the tables clustered like we often do in SQL Server, the more frequently the data in the table changes, the more work is involved in maintaining the clustering.

However, if you have tables that don’t change much and are queried all the time, clustering could produce a great outcome.

A few other notes

It’s also worth noting that unlike applying a clustering key to a SQL Server index (via a primary key or a clustered index), the change isn’t applied immediately. It’s done in the background.

You also want to avoid having high cardinality columns as clustering keys. As an example, if you have TIMESTAMP columns (similar to SQL Server datetime or datetime2 columns), you’d be better off adding an expression (like a computed column) that truncated that to a date, and then clustering on that.

Reclustering a table in Snowflake is automatic and is performed when it becomes quite skewed.

For an index to all posts in this series, see the first post here.

2019-11-29