Snowflake for SQL Server users – Part 4 – T-Shirt Sizing

I mentioned in my last post in this series, that the Compute layer of Snowflake is basically made up of a series of Virtual Warehouses (VWs).  Each VW is an MPP (massively parallel processing) compute cluster that can comprise one or more compute nodes.

The number of nodes in the compute cluster is called its "size" and the sizing options are made to resemble T-Shirt sizing, as you can see in the main image above.

Note that XS (extra small) is the smallest VW size. Using an XS for one hour consumes one Snowflake credit. In Australia, in Azure, right now, that's a little over $2 USD.

The number of credits used per hour for a VW is directly related to the number of compute nodes that it contains. So a 2XL VW consumes 32 credits per hour.

Threads

The other important aspect of this sizing is the number of threads. Currently, there are 8 threads per compute node.

So the number of concurrent threads for a VW goes from 8 at the XS level, up to 1024 at the 4XL level.

I'll talk more about threads later but they become important when you're trying to get good parallel loading of files happening, and when you want significant concurrency in query execution.

 

For an index to all posts in this series, see the first post here.

SDU_FileSplit – Free utility for splitting CSV and other text files in Windows

When I was doing some Snowflake training recently, one of the students in the class asked what utility they should use on Windows for splitting a large file into sections. They wanted to split files for better bulk loading performance, to be able to use all available threads.

On Linux systems, the split command works fine but the best that most people came up with on Windows was to use Powershell. That's a fine answer for some people, but not for everyone.

Because the answers were limited, and people struggled to find a CSV splitter, I decided to fix that, and create a simple utility that's targeted as exactly this use case.

SDU_SplitFile

SDU_SplitFile is a brand new command line utility that you can use to split text files (including delimited files).

Usage is as follows:

SDU_FileSplit.exe <InputFilePath> <MaximumLinesPerFile> <HeaderLinesToRepeat> <OutputFolder> <Overwrite> [<OutputFilenameBase>]

The required parameters are positional and are as follows:

<InputFilePath> is the full path to the input file including file extension
<MaximumLinesPerFile> is the maximum number of lines in the output file
<HeaderLinesToRepeat> is the number of header lines to repeat in each file (0 for none, 1 to 10 allowed)
<OutputFolder> is the output folder for the files (it needs to already exist)
<Overwrite> is Y or N to indicate if existing output files should be overwritten

There is one additional optional parameter:

<OutputFilenameBase> is the beginning of the output file name – default is the same as the input file name

Example Usage

Let's take a look at an example. I have a file called Cinemas.csv in the C:\Temp folder. It contains some details of just over 2000 cinemas:

I'll then execute the following command:

This says to split the file Cinemas.csv that's currently in the current folder with a maximum of 200 rows per file.

As you can see in the previous image, the CSV has a single header row. I've chosen to copy that into each output file. That way, we're not just splitting the data rows, we can have a header in each output file.

We've then provided the output folder, and also said Y for overwriting the output files if they already exist.

And in the blink of an eye, we have the required output files, and as a bonus, they're all already UTF-8 encoded:

Downloading SDU_FileSplit

It's easy to get the tool and start using it. You can download a zip file containing it here.

Just download and unzip it. As long as you have .NET Framework 2.0 or later (that's pretty much every Windows system), you should have all the required prerequisites.

I hope you find it useful.

Disclaimer

It's 100% free for you to download and use. I think it's pretty good but as with most free things, I can't guarantee that. You make your own decisions if it's suitable for you.

 

Snowflake for SQL Server users – Part 3 – Core Architecture

The first thing to understand about Snowflake is that it has a very layered approach. And the layers are quite independent, including how they scale.

Cloud Provider Services

The lowest level isn't part of Snowflake; it's the services that are provided by the underlying cloud provider. As a cloud native application, Snowflake is designed to use services from the cloud provider that they are deployed to, rather than providing all the services themselves. At present, that means AWS or Microsoft Azure. Deployment on Google's cloud platform is in preview at this time.

Each deployment of Snowflake includes the upper three layers that I've shown in the main image above.

Storage Layer

This layer is exactly what it says. It uses storage from the cloud provider to provide a way to store anything that needs to be persisted. That includes, the obvious things like databases, schemas, tables, etc. but it also includes less obvious things like caches for the results of queries that have been executed.

Storage is priced separately from compute. While the prices differ across cloud providers and locations, the Snowflake people said they aren't aiming to make a margin much on the storage costs. They're pretty much passing on the cloud provider's cost.

In addition, when you're staging data (that I'll discuss later), you can choose to use Snowflake managed storage or your own storage accounts with a cloud provider to hold that staged data.

Compute Layer

This is the heart of where Snowflake make their income. The compute layer is made up of a series of virtual clusters that provide the compute power to work on the data. Importantly, each of the virtual clusters (called Virtual Warehouses) can independently access the shared underlying storage.

Compute is charged by consuming what are called "credits". I'll write more about those soon. What is interesting is that you only pay for compute while a virtual warehouse is running.

While there are some blocking commands, the idea is that you should do most of those in staging areas, to keep your shared storage accessible in a highly concurrent way. The aim is to have a virtual warehouse happily querying the data, while another virtual warehouse is in the middle of bulk loading other data. And the virtual warehouses can each be different sizes.

Global Services Layer

This is the layer where all the metadata and control lives. It's also where transactions are managed, and where security and data sharing details live. I'll describe each of its functions in future posts.

 

For an index to all posts in this series, see the first post here.

Snowflake for SQL Server users – Part 2 – Cloud First Design

In recent years, I've done a lot of work in software houses (Microsoft calls them ISVs or Independent Software Vendors). Many of these software houses have worked out that they won't be able to just keep selling their on-premises applications because their customers are asking for cloud-based solutions.

And more importantly, the customers want the software houses to manage the applications rather than themselves. So, many of the software houses start trying to turn their on-premises applications into Software as a Service (SaaS) applications.

And the outcome? Is almost always really, really poor.

Cloud First

The real problem is that those applications haven't been written with a cloud-first mentality. They really are often the "square peg in a round hole". What often happens is that the cloud-based versions of the applications have large numbers of limitations compared to the on-premises versions of the same applications, because recreating everything in the cloud (when there were things that weren't designed for the cloud) is often nearly impossible.

I'm a big fan of Azure SQL Database, and they've done a great job on it (way better than almost any other application), but it's still quite a distance short of where we already were with SQL Server on-premises. I wish the marketing for the product would focus on how what is there (i.e. a truly amazing product) but the discussion always seems to be around what's missing, and how it compares to the on-premises product. In fact, I'm sure the entire reason that the Managed Instance versions of Azure SQL Database appeared were to address some of the shortcomings.

If the first time you'd seen SQL Server was to see Azure SQL Database, you'd come away saying how amazing it is. But if you've come from the on-premises product, chances are that you might be already using something that isn't there.

Nice to have a blank slate

Even if you were brand new to Azure SQL Database though, you'd find aspects of how the product is designed that are based on thinking from decades ago, and were designed for systems that were available back then. It's very hard to make major changes when you're bound to trying to keep backwards compatibility.

One key advantage that the team building Snowflake had was a clean slate where they could design a product that targeted cloud provider based services under the covers, instead of on-premises physical devices and operating systems.

 

For an index to all posts in this series, see the first post here.

Snowflake for SQL Server users – Part 1 – Why Snowflake?

A few months back, I started noticing that many of our clients had started to mention Snowflake.

In recent years, I've been in lots of planning and architectural meetings where there was already a presumption that AWS was being used rather than Azure. I put that down to a great selling job by the AWS people who got corporate IT folk locked into large enterprise agreements early. And so no matter what the technical question is, the answer will be something that runs on AWS.

I still think that Azure offers a far stronger cloud story than AWS but I'm looking at the whole end-to-end cloud-based story, particularly with platform services. I just see a stronger, more secure, and better envisaged story in Azure.

Cloud Transformation

A big part of this issue is that many organizations that I go into say they're making a "cloud transformation", yet what they're actually doing, is just moving a whole lot of virtual machines into a cloud hosting center.

That's just re-hosting; it's not making a cloud transformation.

For cloud-hosting providers, it's also a no-win game. When all you're doing is re-hosting VMs, you're mostly competing on price, and that's a race to the bottom that no-one really wins.

Data Warehouses

One bright note on the horizon though is around data warehouses. Almost all these corporates seem to get the idea that they want a cloud data warehouse as a service, not just a bunch of re-hosted virtual machines.

For customers who've already made the decision to use AWS (before we come into the meeting), when they look around for cloud data warehouses, Snowflake is one that's often chosen quickly.

A bonus for me, is that it's now available on both AWS and on Azure.

I've been spending a lot of time lately digging into Snowflake, and in this series of posts, I'll look at Snowflake from an existing SQL Server user's perspective.

Posts in the series

Part 1 – Why Snowflake? 
Part 2 – Cloud First Design
Part 3 – Core Architecture
Part 4 – T-Shirt Sizing
Part 5 – Editions and Security
Part 6 – Role Based Security
Part 7 – Authentication
Part 8 – Case-Sensitivity 
Part 9 – Stages
Part 10 – File formats
Part 11 – GET, PUT, SnowSQL 
Part 12 – Parallelism when loading files 
Part 13 – Programmable objects
Part 14 – Internal storage and micropartitions
Part 15 – Table types 
Part 16 – Primary and Foreign Keys
Part 17 – Clustering Keys for Tables
Part 18 – Time Travel
Part 19 – Fail Safe
Part 20 – Encryption

And more coming…