Snowflake for SQL Server Users - Part 14 - Internal storage in Micropartitions

If you want to learn a lot about how Snowflake works internally, it’s worth spending some time reading the ACM whitepaper that described the service.

It describes the overall structure of the service, and then describes how data is stored, followed by a description of how the encryption of the data within the service is accomplished. (I’ll talk more about the management of encryption keys in a later blog post).

Columnstores

Columnstores will be familiar to SQL Server users ever since 2012, and it’s no surprise that the storage within Snowflake is essentially columnstore based.

Columnstores are great for adding data into, and are excellent for reading large amounts of data in bulk. What they aren’t great for, is being updated. Clustered columnstore indexes in SQL Server are updatable but only because they have an associated rowstore that’s used for the delta store.

Micropartitions

In Snowflake, the columnstores are formed by a series of what are called micropartitions. Each of these is a contiguous storage location that holds up to 16MB of compressed data (50 to 500 MB of uncompressed data), but importantly, the micropartitions are immutable i.e. once they are created, they are never modified.

The metadata for the micropartitions records the range of values for each of the columns that are stored in the micropartition, the number of distinct values, and a few other properties.

Tables get partitioned automatically during insert and load operations, based upon the order of the incoming data.

Interestingly, the file extension for these is FDN which is short for flocon de neige i.e. the French word for Snowflake.

Query performance against the columnstores is highly dependent upon being able to determine which micropartitions need to be read to satisfy a particular query. In Snowflake, the term used to describe this partition elimination is pruning.

For an index to all posts in this series, see the first post here.

2019-11-08