Snowflake for SQL Server users – Part 6 – Role Based Security

Security in Snowflake is quite different to what's implemented in SQL Server.

In SQL Server, objects have owners that can either be specific people, or as occurs when a dbo person creates an object, it is owned by the dbo role. In Snowflake, objects in the database also have owners, but the owners can't be users, they are always roles.

When you create a database, there are four pre-defined roles:

  • ACCOUNTADMIN
  • SECURITYADMIN
  • SYSADMIN
  • PUBLIC

The ACCOUNTADMIN role is the closest to what we have in SQL Server with the sysadmin role. The SYSADMIN role in Snowflake is much more limited. For example, if a user in the SALESPERSON role creates a table, a SYSADMIN role member cannot even see that table, unless the SYSADMIN role is added to the SALESPERSON role.

Now it's recommended that the SYSADMIN role is added to all other roles, to avoid this issue but there's no concept of a DDL trigger that would allow you to enforce that.

One at a time

Another very curious concept is that your security context can only be in one role at a time. That's quite unlike Windows, and SQL Server where you can do whatever any of the roles that you are in can do.

There is a USE Database like we have in SQL Server but there is also USE WAREHOUSE to decide which virtual warehouse will be used for compute costs, and USE ROLE to decide which role you are operating in. So commands that you execute are always in a context of:

  • Which database
  • Which warehouse
  • Which role

DDL commands look odd

While the concept of granting a permission to a role is familiar, and works much the same way, they also use the words GRANT and REVOKE for role membership.

I find that really odd. In my mind, roles are things that have members, and the DDL should relate to adding and removing members from roles.  Roles can be nested, and so if, for example, you want to add the role SALESPEOPLE to the role EMPLOYEES, you say:

GRANT EMPLOYEES to SALESPEOPLE;

So I can't say that I like the way that they've implemented the DDL commands for roles. I think there should be separate DDL for granting and revoking permissions from roles, and for adding and removing members of roles.

Schemas are also a bit odd

Another thing that I find odd is that when you grant a role permission to SELECT on a schema, it only applies that to objects that are currently in the schema. So that's just like a shortcut way to avoid granting to every object in the schema. Normally when I grant permissions at the schema level, it's for all items currently in the schema, and all objects that will ever be in the schema.

Curiously, Snowflake has an extra permission for that, where you need to separately tell it "oh, and all future objects too".

Transferring Ownership

To transfer ownership of an object, you need to:

  • Remove all permission grants
  • Change ownership
  • Re-establish all permission grants

If you delete a role, the role that you are currently executing in when you execute the delete, becomes the owning role for all the objects in the deleted role.

Checking current grants

You can check the grants that have been made by:

SHOW GRANTS TO ROLE
SHOW GRANTS TO USER

 

For an index to all posts in this series, see the first post here.

Snowflake for SQL Server users – Part 5 – Editions and Security Features

Like most products, Snowflake comes in a number of editions, and you can see the current editions in the main image above. (Keep in mind that they could always change at any time and to check their site for the current options).

First thing I need to say is that I really like the way that most of the SQL code surface is pretty much identical across editions. I wish that was complete coverage but it currently doesn't include materialized views.

Note: That was a great change when the SQL Server team did the same back in SQL Server 2016 SP1.

There is no free edition like we have with SQL Server Express. Similarly, there's no free option for developers like we have with SQL Server Developer Edition. That's not surprising though, as they aren't the cloud provider, they are purchasing services from cloud providers. I find that the Standard edition is pretty low cost though: you only pay a fairly low amount for storage, and you only pay for compute when you use it. So that's not too bad.

Differences in Editions

The main difference between Standard and Premier is that the latter comes with premier support. So that's not a bad distinction from say development tasks, to production tasks. I'd rather see that as just a single edition, with support an optional extra over all editions.

Snowflake has a feature called Time Travel. This allows you to see what data looked like at earlier times. It's a bit like temporal tables but also quite different to it. I'll talk more about it in another post.

Standard and Premier both have one day of time travel though, and Enterprise edition takes you up to 90 days. I like to see that sort of feature used as a differentiator between editions. It mostly wouldn't require code changes when working with different editions.

Business Critical basically introduces more security. It adds HIPAA and PCI compliance, and the ability to use customer-managed encryption keys. I can't say that I love the idea of core compliance as a distinction between editions. Everyone's data is important to them. Customer managed keys are a good edition differentiator though.

Snowflake data gets encrypted anyway, and with a key that changes each month (for new data). But on lower editions, it doesn't get re-keyed. What Business Critical also adds is annual key rotation. Data that's a year old gets decrypted and re-encrypted with a new key.

VPS or Virtual Private Snowflake is for people who can't tolerate the idea of any sort of shared Snowflake infrastructure. The Snowflake team do a completely separate deployment of the whole Snowflake stack, just for each customer. It's super expensive (I heard it starts at over $50M AUD) and so I can't imagine too many customers using it, but I'm sure there will be a few, including right here in Australia.

I heard that VPS was only available on AWS at the time of writing, but I'm sure that will change. And I'm guessing if you front up with $50M+, and say you want it on Azure, it's unlikely they'd say no.

 

For an index to all posts in this series, see the first post here.

Snowflake for SQL Server users – Part 4 – T-Shirt Sizing

I mentioned in my last post in this series, that the Compute layer of Snowflake is basically made up of a series of Virtual Warehouses (VWs).  Each VW is an MPP (massively parallel processing) compute cluster that can comprise one or more compute nodes.

The number of nodes in the compute cluster is called its "size" and the sizing options are made to resemble T-Shirt sizing, as you can see in the main image above.

Note that XS (extra small) is the smallest VW size. Using an XS for one hour consumes one Snowflake credit. In Australia, in Azure, right now, that's a little over $2 USD.

The number of credits used per hour for a VW is directly related to the number of compute nodes that it contains. So a 2XL VW consumes 32 credits per hour.

Threads

The other important aspect of this sizing is the number of threads. Currently, there are 8 threads per compute node.

So the number of concurrent threads for a VW goes from 8 at the XS level, up to 1024 at the 4XL level.

I'll talk more about threads later but they become important when you're trying to get good parallel loading of files happening, and when you want significant concurrency in query execution.

 

For an index to all posts in this series, see the first post here.

SDU_FileSplit – Free utility for splitting CSV and other text files in Windows

When I was doing some Snowflake training recently, one of the students in the class asked what utility they should use on Windows for splitting a large file into sections. They wanted to split files for better bulk loading performance, to be able to use all available threads.

On Linux systems, the split command works fine but the best that most people came up with on Windows was to use Powershell. That's a fine answer for some people, but not for everyone.

Because the answers were limited, and people struggled to find a CSV splitter, I decided to fix that, and create a simple utility that's targeted as exactly this use case.

SDU_SplitFile

SDU_SplitFile is a brand new command line utility that you can use to split text files (including delimited files).

Usage is as follows:

SDU_FileSplit.exe <InputFilePath> <MaximumLinesPerFile> <HeaderLinesToRepeat> <OutputFolder> <Overwrite> [<OutputFilenameBase>]

The required parameters are positional and are as follows:

<InputFilePath> is the full path to the input file including file extension
<MaximumLinesPerFile> is the maximum number of lines in the output file
<HeaderLinesToRepeat> is the number of header lines to repeat in each file (0 for none, 1 to 10 allowed)
<OutputFolder> is the output folder for the files (it needs to already exist)
<Overwrite> is Y or N to indicate if existing output files should be overwritten

There is one additional optional parameter:

<OutputFilenameBase> is the beginning of the output file name – default is the same as the input file name

Example Usage

Let's take a look at an example. I have a file called Cinemas.csv in the C:\Temp folder. It contains some details of just over 2000 cinemas:

I'll then execute the following command:

This says to split the file Cinemas.csv that's currently in the current folder with a maximum of 200 rows per file.

As you can see in the previous image, the CSV has a single header row. I've chosen to copy that into each output file. That way, we're not just splitting the data rows, we can have a header in each output file.

We've then provided the output folder, and also said Y for overwriting the output files if they already exist.

And in the blink of an eye, we have the required output files, and as a bonus, they're all already UTF-8 encoded:

Downloading SDU_FileSplit

It's easy to get the tool and start using it. You can download a zip file containing it here.

Just download and unzip it. As long as you have .NET Framework 2.0 or later (that's pretty much every Windows system), you should have all the required prerequisites.

I hope you find it useful.

Disclaimer

It's 100% free for you to download and use. I think it's pretty good but as with most free things, I can't guarantee that. You make your own decisions if it's suitable for you.

 

Snowflake for SQL Server users – Part 3 – Core Architecture

The first thing to understand about Snowflake is that it has a very layered approach. And the layers are quite independent, including how they scale.

Cloud Provider Services

The lowest level isn't part of Snowflake; it's the services that are provided by the underlying cloud provider. As a cloud native application, Snowflake is designed to use services from the cloud provider that they are deployed to, rather than providing all the services themselves. At present, that means AWS or Microsoft Azure. Deployment on Google's cloud platform is in preview at this time.

Each deployment of Snowflake includes the upper three layers that I've shown in the main image above.

Storage Layer

This layer is exactly what it says. It uses storage from the cloud provider to provide a way to store anything that needs to be persisted. That includes, the obvious things like databases, schemas, tables, etc. but it also includes less obvious things like caches for the results of queries that have been executed.

Storage is priced separately from compute. While the prices differ across cloud providers and locations, the Snowflake people said they aren't aiming to make a margin much on the storage costs. They're pretty much passing on the cloud provider's cost.

In addition, when you're staging data (that I'll discuss later), you can choose to use Snowflake managed storage or your own storage accounts with a cloud provider to hold that staged data.

Compute Layer

This is the heart of where Snowflake make their income. The compute layer is made up of a series of virtual clusters that provide the compute power to work on the data. Importantly, each of the virtual clusters (called Virtual Warehouses) can independently access the shared underlying storage.

Compute is charged by consuming what are called "credits". I'll write more about those soon. What is interesting is that you only pay for compute while a virtual warehouse is running.

While there are some blocking commands, the idea is that you should do most of those in staging areas, to keep your shared storage accessible in a highly concurrent way. The aim is to have a virtual warehouse happily querying the data, while another virtual warehouse is in the middle of bulk loading other data. And the virtual warehouses can each be different sizes.

Global Services Layer

This is the layer where all the metadata and control lives. It's also where transactions are managed, and where security and data sharing details live. I'll describe each of its functions in future posts.

 

For an index to all posts in this series, see the first post here.

Snowflake for SQL Server users – Part 2 – Cloud First Design

In recent years, I've done a lot of work in software houses (Microsoft calls them ISVs or Independent Software Vendors). Many of these software houses have worked out that they won't be able to just keep selling their on-premises applications because their customers are asking for cloud-based solutions.

And more importantly, the customers want the software houses to manage the applications rather than themselves. So, many of the software houses start trying to turn their on-premises applications into Software as a Service (SaaS) applications.

And the outcome? Is almost always really, really poor.

Cloud First

The real problem is that those applications haven't been written with a cloud-first mentality. They really are often the "square peg in a round hole". What often happens is that the cloud-based versions of the applications have large numbers of limitations compared to the on-premises versions of the same applications, because recreating everything in the cloud (when there were things that weren't designed for the cloud) is often nearly impossible.

I'm a big fan of Azure SQL Database, and they've done a great job on it (way better than almost any other application), but it's still quite a distance short of where we already were with SQL Server on-premises. I wish the marketing for the product would focus on how what is there (i.e. a truly amazing product) but the discussion always seems to be around what's missing, and how it compares to the on-premises product. In fact, I'm sure the entire reason that the Managed Instance versions of Azure SQL Database appeared were to address some of the shortcomings.

If the first time you'd seen SQL Server was to see Azure SQL Database, you'd come away saying how amazing it is. But if you've come from the on-premises product, chances are that you might be already using something that isn't there.

Nice to have a blank slate

Even if you were brand new to Azure SQL Database though, you'd find aspects of how the product is designed that are based on thinking from decades ago, and were designed for systems that were available back then. It's very hard to make major changes when you're bound to trying to keep backwards compatibility.

One key advantage that the team building Snowflake had was a clean slate where they could design a product that targeted cloud provider based services under the covers, instead of on-premises physical devices and operating systems.

 

For an index to all posts in this series, see the first post here.

Snowflake for SQL Server users – Part 1 – Why Snowflake?

A few months back, I started noticing that many of our clients had started to mention Snowflake.

In recent years, I've been in lots of planning and architectural meetings where there was already a presumption that AWS was being used rather than Azure. I put that down to a great selling job by the AWS people who got corporate IT folk locked into large enterprise agreements early. And so no matter what the technical question is, the answer will be something that runs on AWS.

I still think that Azure offers a far stronger cloud story than AWS but I'm looking at the whole end-to-end cloud-based story, particularly with platform services. I just see a stronger, more secure, and better envisaged story in Azure.

Cloud Transformation

A big part of this issue is that many organizations that I go into say they're making a "cloud transformation", yet what they're actually doing, is just moving a whole lot of virtual machines into a cloud hosting center.

That's just re-hosting; it's not making a cloud transformation.

For cloud-hosting providers, it's also a no-win game. When all you're doing is re-hosting VMs, you're mostly competing on price, and that's a race to the bottom that no-one really wins.

Data Warehouses

One bright note on the horizon though is around data warehouses. Almost all these corporates seem to get the idea that they want a cloud data warehouse as a service, not just a bunch of re-hosted virtual machines.

For customers who've already made the decision to use AWS (before we come into the meeting), when they look around for cloud data warehouses, Snowflake is one that's often chosen quickly.

A bonus for me, is that it's now available on both AWS and on Azure.

I've been spending a lot of time lately digging into Snowflake, and in this series of posts, I'll look at Snowflake from an existing SQL Server user's perspective.

Posts in the series

Part 1 – Why Snowflake? 
Part 2 – Cloud First Design
Part 3 – Core Architecture
Part 4 – T-Shirt Sizing
Part 5 – Editions and Security
Part 6 – Role Based Security
Part 7 – Authentication
Part 8 – Case-Sensitivity 
Part 9 – Stages
Part 10 – File formats
Part 11 – GET, PUT, SnowSQL 
Part 12 – Parallelism when loading files 
Part 13 – Programmable objects
Part 14 – Internal storage and micropartitions
Part 15 – Table types 
Part 16 – Primary and Foreign Keys

And more coming…