Snowflake for SQL Server Users – Part 14 – Internal storage in Micropartitions

If you want to learn a lot about how Snowflake works internally, it's worth spending some time reading the ACM whitepaper that described the service.

It describes the overall structure of the service, and then describes how data is stored, followed by a description of how the encryption of the data within the service is accomplished. (I'll talk more about the management of encryption keys in a later blog post).

Columnstores

Columnstores will be familiar to SQL Server users ever since 2012, and it's no surprise that the storage within Snowflake is essentially columnstore based.

Columnstores are great for adding data into, and are excellent for reading large amounts of data in bulk. What they aren't great for, is being updated. Clustered columnstore indexes in SQL Server are updatable but only because they have an associated rowstore that's used for the delta store.

Micropartitions

In Snowflake, the columnstores are formed by a series of what are called micropartitions. Each of these is a contiguous storage location that holds up to 16MB of compressed data (50 to 500 MB of uncompressed data), but importantly, the micropartitions are immutable i.e. once they are created, they are never modified.

The metadata for the micropartitions records the range of values for each of the columns that are stored in the micropartition, the number of distinct values, and a few other properties.

Tables get partitioned automatically during insert and load operations, based upon the order of the incoming data.

Interestingly, the file extension for these is FDN which is short for flocon de neige i.e. the French word for Snowflake.

Query performance against the columnstores is highly dependent upon being able to determine which micropartitions need to be read to satisfy a particular query. In Snowflake, the term used to describe this partition elimination is pruning.

 

For an index to all posts in this series, see the first post here.

Snowflake for SQL Server Users – Part 13 – Programmable objects

Similar to SQL Server, Snowflake has ways of creating programmable objects. But the way they work, and the way they are created is quite different.

Functions

Functions are the most similar. You can create them in two ways:

  • Javascript
  • T-SQL

I like the idea that you can choose which language to write code in, and that they both end up extending the Snowflake SQL language pretty much the same. Some code is better written in T-SQL and other code is better written in a higher-level language.

Functions are called as expected, in place of an expression.

Both scalar functions and table-valued functions are supported. Scalar functions must return a value.

Stored Procedures

There is the ability to write stored procedures, but curiously, you can only do that in Javascript.

I have to say I've never worked with a SQL database engine before that supports stored procedures but won't let you write stored procedures in SQL. I think this is quite a shortcoming in the product.

Stored procedures are called using the CALL statement (not EXEC as in SQL Server). Another curious aspect is that even though the stored procedures support a return value, the syntax for calling stored procedures via CALL doesn't support retrieving a return value. I have to say, that's quite bizarre.

You can pass values back from stored procedures by using temporary tables. Or if the returned data is small enough, you might be able to stuff it into a variant data type object and return that.

Stored procedures can be nested.

Triggers

There is currently no concept of a trigger in Snowflake at this time. That means neither DML (INSERT/UPDATE/DELETE) triggers and DDL (CREATE/ALTER/DROP/LOGON) triggers.

As triggers are often a necessary evil in some applications, this again is a significant shortcoming of the product at this time.

 

For an index to all posts in this series, see the first post here.

 

Snowflake for SQL Server Users – Part 12 – Parallelism when loading data from files into tables

When you are loading data into Snowflake, it's really important to achieve the maximum parallelism that you can. You want as many files loading in parallel as you have. I mentioned in earlier posts that the number of servers that you have, and the size of each of those servers, will determine the number of processor threads that are available to you from your virtual warehouses.

It would be pointless to have 32 processor threads waiting to load your data, and you provide Snowflake with one large file to load.

Instead, you should consider having at least as many smaller files to load as you have available processor threads. This means that you should favour many small files, rather than a smaller number of larger files.

Breaking up large files

This means that it can be desirable to break up existing large files. When I was doing Snowflake training, a number of people asked about good options for breaking up large text files. We couldn't find a good option so I wrote one.

You can find details of SDU_FileSplit here. As well as breaking up the files, it can transfer the header rows into the split files and more.

On Linux, you could just use the split command line utility.

File Sizes

In terms of how large the files should be, the Snowflake documentation recommends 10MB to 100MB of compressed data. (Most people load zip files).

That also means that if your files are much smaller, you should aggregate them before loading them, to get them into the target file size.

 

For an index to all posts in this series, see the first post here.

 

Snowflake for SQL Server Users – Part 11 – GET, PUT and SnowSQL for working with local files

In a previous post, I talked about stages. They are internal or external cloud storage locations that you can use the COPY command to copy data into database tables from or use the COPY command to export data from database tables.

Now if you are using external stages, they're just standard storage accounts in AWS (S3), Azure Storage, or Google (GCS). You can use whatever tools you want to get files from other locations (like your local file system) to/from these accounts.

But if you want to get data in or out of the internal stages, you need to talk to the Snowflake API. While you could write your own code to do that, the easiest way to do that is by using a tool called SnowSQL.

SnowSQL

Snowflake has a number of connectors. SnowSQL runs on 64bit versions of Windows, macOS, and Linux. It's built using the Snowflake Connector for Python. That connector is just straight Python coding. On Windows, for Python v2, you need 2.7.9 or higher. For Python v3, you need 3.5.0 or higher.

SnowSQL is a command line client that you can run interactively as a shell, or you can run in batch mode. Commands can be referenced via a -f parameter (common on Windows) or redirected into it via stdin (common on Linux and macOS).

Using SnowSQL is very much like using OSQL or SQLCMD. Like SQLCMD though, it has a rich command language of its own with variables, auto-complete, command line history, and variable substitution.

SnowSQL Commands

You can run SQL commands through SnowSQL but most people would use other tools for that. The commands that most people will use SnowSQL for are the GET and PUT commands. These are not SQL commands but SnowSQL commands.

GET is used to retrieve files from the internal stages.

PUT is used to upload files into the internal stages.

Other useful (and related) commands are:

LIST is used to list the files in either internal or external stages.

REMOVE is used to delete files from the internal stages.

 

For an index to all posts in this series, see the first post here.

 

 

 

Snowflake for SQL Server Users – Part 10 – Working with file formats

One thing that I quite like about Snowflake is the way it cleanly works with a wide variety of file formats.

Obviously this changes over time but at the time of writing, you could COPY from the following source file formats:

  • CSV
  • JSON
  • AVRO
  • ORC
  • PARQUET
  • XML

There are also quite a number of options for configuring how these are used. Apart from the obvious options like record and field delimiters, skipping rows, etc, one of the most important of these options is compression. You can currently choose these options for compression:

  • AUTO
  • GZIP
  • BZ2
  • BROTLI
  • ZSTD
  • DEFLATE
  • RAW_DEFLATE
  • NONE

I had good outcomes using the AUTO setting. Mostly I'm using zip'd input files.

There are a few options for transforming data while loading it, but they are very limited. For example, you can alias a column or change its position.

Named File Format

Similar to the way you can avoid specifying full file locations every time by creating an external stage, you can avoid specifying all your file format details every time by creating a named file format.

No surprise, that's done with the CREATE FILE FORMAT command.

Data Export

When you export data, you again use the COPY command, but currently there are less file format options available than there are for input. You can use:

  • CSV
  • JSON
  • PARQUET

I would be surprised if the others aren't added soon.

 

For an index to all posts in this series, see the first post here.

Snowflake for SQL Server Users – Part 9 – Stages

Snowflake has the normal options in its SQL language for using INSERT statements to add data to a table. But it also has bulk options for rapidly adding a lot of data to a table via the COPY command.  The same command can be used for bulk export as well.

A curious aspect of this though, is that because it's a cloud-only database system, you can't just use COPY to get data to/from a local file system. COPY works to/from what's called a stage.

A stage is a cloud-based storage location, that's just used as a staging location for data.

There are two basic types of stages: the ones provided within Snowflake itself, and the ones that are located in public cloud providers. For the public cloud providers, you can currently choose one of:

  • AWS S3 bucket
  • Azure storage location
  • Google cloud storage (GCS)

When you work with these, you can specify them as an external location (where you put the full address), or you can create an external stage. An external stage is just a name that you give to details of the location. That lets you just use a name, and not have to repeat details of where the location is, all through your code.

The stages provided by Snowflake are as follows:

  • Each user has a default stage when they are connected. A user stage is referred to by the name @~ and is automatically created for you.
  • Each table in the database also has a default stage associated with the name of the table. You refer to a table stage as @%tablename and again, these are automatically created for you.
  • You can also ask Snowflake to create a stage for you. You create an internal named stage by using the CREATE STAGE command. These are normally permanent but you can also make them session-temporary by adding the word TEMPORARY when creating them.

Next time, we'll look at the types of files that Snowflake can import/export.

 

For an index to all posts in this series, see the first post here.

Snowflake for SQL Server Users – Part 8 – Case Sensitivity

There are many things I like about Snowflake. How they handle case and collations is not one of them.

There are currently no rich options for handling case like you have in SQL Server, with detailed options around both collations, and case sensitivity.

I've previously written about how I think that case-sensitivity is a pox on computing. I see absolutely no value in case-sensitivity in business applications, and a significant downside.

Case preservation is a different thing. I expect systems to remember the case that I define things with, but 99.9% of the time, I want to search for them without caring about case. All that case-sensitivity does provide is the ability to have two objects in the same scope that differ only by different capital letters in their names. That's usually the result of lazy coding, and almost never a good idea.

Snowflake is basically case-sensitive. I wish it wasn't. There are, however, some workarounds.

Object Names

To get around some of the problems that case-sensitivity causes, Snowflake automatically upper-cases object names when you define or use them. By default, Snowflake treats the objects Customers, customers, and CUSTOMERS as the same object. In fact, if you execute a statement like:

CREATE TABLE Customers

what it will create is a table called CUSTOMERS.  If you execute

SELECT 2 AS Value;

you'll get back the value 2 in a column called VALUE, not the column name that you asked for.

That breaks my basic wish (in any language) for case-preservation. Most objects that you see in almost every Snowflake presentation have names that are all capitalized.

You can, however, get around this by quoting each name with double-quotes.

CREATE TABLE "Customers"

And then you need to do that for every table, every column, every object, etc. from then on. If you execute:

SELECT 2 AS "Value";

you'll get the value 2 with the column name that you're after.

Case-Related Comparisons

To get around the idea that most people won't want to compare strings in a case-sensitive way, they've created some different operators to deal with case. For example, if you use

WHERE "CustomerName" ILIKE 'Fred%'

you get a case-insensitive version of LIKE.

Added To My Wish List

The Snowflake people really need to fix how they handle case. You can see from the requests in their user forums that I'm not the only one that thinks so.

This aspect of the product feels very Oracle-like, and also feels like being back in the 1960s. Humans don't like screaming snake case. I really hope they will fix it soon as it's currently one of the the weakest aspects of the product.

 

For an index to all posts in this series, see the first post here.

 

 

 

 

 

Snowflake for SQL Server users – Part 7 – Authentication

Authentication in Snowflake is based around user identities. That's very similar to what we have today with SQL Server authentication (i.e. SQL Server logs you on), as opposed to Windows Active Directory authentication (i.e. Windows AD logs you on), or  Azure Active Directory authentication (i.e. AAD logs you on).

I can create users if I'm a member of either the SECURITYADMIN or ACCOUNTADMIN roles.

Snowflake has a single specific password policy:

  • at least 8 chars long
  • at least 1 digit
  • at least 1 uppercase and 1 lowercase character

Similar to SQL Server logins, I can force a user to change password the first time.

I'm not a fan of these types of password policies, and the NIST recommends against them. I would like to see Snowflake (and SQL Server) have a more up-to-date password policy in place.

Users can be disabled/enabled via ALTER USER and you can set default warehouse and role for a user. DESCRIBE USER and SHOW USERS are useful commands for checking out who is configured.

MFA

Multi-factor authentication is also supported. You should have at least your administrators enrolled for MFA.

Enabling MFA currently means using the Duo application on your phone or device for the extra authentication. To be really good citizens within Azure and within Google Cloud, I'd like to see them also support Microsoft Authenticator and Google Authenticator as well.

Integration/Federation

OAuth 2.0 is supported and you can also achieve a degree of single sign on by implementing federation. Snowflake seem to work closely with Octa for single sign on and MFA.

While Snowflake list Azure Active Directory as a potential integration, it really needs a lot more work to integrate properly with AAD. In particular, there is no concept of granting access to Snowflake or role membership within Snowflake to groups in AAD.

Because there is no concept of group membership, the DCL commands are GRANT and REVOKE, and there is no concept of a DENY as you can need when working with groups.

There is also no concept of a logon trigger.

My Authentication Wish List

I really hope they will address concepts like groups and logon/logoff triggers fairly quickly. I do not like to see database administrators getting involved with individual user access. They should manage at a group level, and someone else in the organization should be deciding who is or isn't a member of a group.

Better still would be to see a detailed integration with Azure Active Directory (AAD) and let the directory service do the hard work. For example, there's no way to enforce admins to use MFA or to implement detailed MFA-related policies for other users and so on.

I heard they are looking to implement AAD integration via Octa. Direct integration would be my preference. Full integration with AAD would provide what's needed for this and much, much more.

 

For an index to all posts in this series, see the first post here.

 

Snowflake for SQL Server users – Part 6 – Role Based Security

Security in Snowflake is quite different to what's implemented in SQL Server.

In SQL Server, objects have owners that can either be specific people, or as occurs when a dbo person creates an object, it is owned by the dbo role. In Snowflake, objects in the database also have owners, but the owners can't be users, they are always roles.

When you create a database, there are four pre-defined roles:

  • ACCOUNTADMIN
  • SECURITYADMIN
  • SYSADMIN
  • PUBLIC

The ACCOUNTADMIN role is the closest to what we have in SQL Server with the sysadmin role. The SYSADMIN role in Snowflake is much more limited. For example, if a user in the SALESPERSON role creates a table, a SYSADMIN role member cannot even see that table, unless the SYSADMIN role is added to the SALESPERSON role.

Now it's recommended that the SYSADMIN role is added to all other roles, to avoid this issue but there's no concept of a DDL trigger that would allow you to enforce that.

One at a time

Another very curious concept is that your security context can only be in one role at a time. That's quite unlike Windows, and SQL Server where you can do whatever any of the roles that you are in can do.

There is a USE Database like we have in SQL Server but there is also USE WAREHOUSE to decide which virtual warehouse will be used for compute costs, and USE ROLE to decide which role you are operating in. So commands that you execute are always in a context of:

  • Which database
  • Which warehouse
  • Which role

DDL commands look odd

While the concept of granting a permission to a role is familiar, and works much the same way, they also use the words GRANT and REVOKE for role membership.

I find that really odd. In my mind, roles are things that have members, and the DDL should relate to adding and removing members from roles.  Roles can be nested, and so if, for example, you want to add the role SALESPEOPLE to the role EMPLOYEES, you say:

GRANT EMPLOYEES to SALESPEOPLE;

So I can't say that I like the way that they've implemented the DDL commands for roles. I think there should be separate DDL for granting and revoking permissions from roles, and for adding and removing members of roles.

Schemas are also a bit odd

Another thing that I find odd is that when you grant a role permission to SELECT on a schema, it only applies that to objects that are currently in the schema. So that's just like a shortcut way to avoid granting to every object in the schema. Normally when I grant permissions at the schema level, it's for all items currently in the schema, and all objects that will ever be in the schema.

Curiously, Snowflake has an extra permission for that, where you need to separately tell it "oh, and all future objects too".

Transferring Ownership

To transfer ownership of an object, you need to:

  • Remove all permission grants
  • Change ownership
  • Re-establish all permission grants

If you delete a role, the role that you are currently executing in when you execute the delete, becomes the owning role for all the objects in the deleted role.

Checking current grants

You can check the grants that have been made by:

SHOW GRANTS TO ROLE
SHOW GRANTS TO USER

 

For an index to all posts in this series, see the first post here.

Snowflake for SQL Server users – Part 5 – Editions and Security Features

Like most products, Snowflake comes in a number of editions, and you can see the current editions in the main image above. (Keep in mind that they could always change at any time and to check their site for the current options).

First thing I need to say is that I really like the way that most of the SQL code surface is pretty much identical across editions. I wish that was complete coverage but it currently doesn't include materialized views.

Note: That was a great change when the SQL Server team did the same back in SQL Server 2016 SP1.

There is no free edition like we have with SQL Server Express. Similarly, there's no free option for developers like we have with SQL Server Developer Edition. That's not surprising though, as they aren't the cloud provider, they are purchasing services from cloud providers. I find that the Standard edition is pretty low cost though: you only pay a fairly low amount for storage, and you only pay for compute when you use it. So that's not too bad.

Differences in Editions

The main difference between Standard and Premier is that the latter comes with premier support. So that's not a bad distinction from say development tasks, to production tasks. I'd rather see that as just a single edition, with support an optional extra over all editions.

Snowflake has a feature called Time Travel. This allows you to see what data looked like at earlier times. It's a bit like temporal tables but also quite different to it. I'll talk more about it in another post.

Standard and Premier both have one day of time travel though, and Enterprise edition takes you up to 90 days. I like to see that sort of feature used as a differentiator between editions. It mostly wouldn't require code changes when working with different editions.

Business Critical basically introduces more security. It adds HIPAA and PCI compliance, and the ability to use customer-managed encryption keys. I can't say that I love the idea of core compliance as a distinction between editions. Everyone's data is important to them. Customer managed keys are a good edition differentiator though.

Snowflake data gets encrypted anyway, and with a key that changes each month (for new data). But on lower editions, it doesn't get re-keyed. What Business Critical also adds is annual key rotation. Data that's a year old gets decrypted and re-encrypted with a new key.

VPS or Virtual Private Snowflake is for people who can't tolerate the idea of any sort of shared Snowflake infrastructure. The Snowflake team do a completely separate deployment of the whole Snowflake stack, just for each customer. It's super expensive (I heard it starts at over $50M AUD) and so I can't imagine too many customers using it, but I'm sure there will be a few, including right here in Australia.

I heard that VPS was only available on AWS at the time of writing, but I'm sure that will change. And I'm guessing if you front up with $50M+, and say you want it on Azure, it's unlikely they'd say no.

 

For an index to all posts in this series, see the first post here.