Snowflake for SQL Server users – Part 4 – T-Shirt Sizing

I mentioned in my last post in this series, that the Compute layer of Snowflake is basically made up of a series of Virtual Warehouses (VWs).  Each VW is an MPP (massively parallel processing) compute cluster that can comprise one or more compute nodes.

The number of nodes in the compute cluster is called its "size" and the sizing options are made to resemble T-Shirt sizing, as you can see in the main image above.

Note that XS (extra small) is the smallest VW size. Using an XS for one hour consumes one Snowflake credit. In Australia, in Azure, right now, that's a little over $2 USD.

The number of credits used per hour for a VW is directly related to the number of compute nodes that it contains. So a 2XL VW consumes 32 credits per hour.

Threads

The other important aspect of this sizing is the number of threads. Currently, there are 8 threads per compute node.

So the number of concurrent threads for a VW goes from 8 at the XS level, up to 1024 at the 4XL level.

I'll talk more about threads later but they become important when you're trying to get good parallel loading of files happening, and when you want significant concurrency in query execution.

 

For an index to all posts in this series, see the first post here.

SDU_FileSplit – Free utility for splitting CSV and other text files in Windows

When I was doing some Snowflake training recently, one of the students in the class asked what utility they should use on Windows for splitting a large file into sections. They wanted to split files for better bulk loading performance, to be able to use all available threads.

On Linux systems, the split command works fine but the best that most people came up with on Windows was to use Powershell. That's a fine answer for some people, but not for everyone.

Because the answers were limited, and people struggled to find a CSV splitter, I decided to fix that, and create a simple utility that's targeted as exactly this use case.

SDU_SplitFile

SDU_SplitFile is a brand new command line utility that you can use to split text files (including delimited files).

Usage is as follows:

SDU_FileSplit.exe <InputFilePath> <MaximumLinesPerFile> <HeaderLinesToRepeat> <OutputFolder> <Overwrite> [<OutputFilenameBase>]

The required parameters are positional and are as follows:

<InputFilePath> is the full path to the input file including file extension
<MaximumLinesPerFile> is the maximum number of lines in the output file
<HeaderLinesToRepeat> is the number of header lines to repeat in each file (0 for none, 1 to 10 allowed)
<OutputFolder> is the output folder for the files (it needs to already exist)
<Overwrite> is Y or N to indicate if existing output files should be overwritten

There is one additional optional parameter:

<OutputFilenameBase> is the beginning of the output file name – default is the same as the input file name

Example Usage

Let's take a look at an example. I have a file called Cinemas.csv in the C:\Temp folder. It contains some details of just over 2000 cinemas:

I'll then execute the following command:

This says to split the file Cinemas.csv that's currently in the current folder with a maximum of 200 rows per file.

As you can see in the previous image, the CSV has a single header row. I've chosen to copy that into each output file. That way, we're not just splitting the data rows, we can have a header in each output file.

We've then provided the output folder, and also said Y for overwriting the output files if they already exist.

And in the blink of an eye, we have the required output files, and as a bonus, they're all already UTF-8 encoded:

Downloading SDU_FileSplit

It's easy to get the tool and start using it. You can download a zip file containing it here.

Just download and unzip it. As long as you have .NET Framework 2.0 or later (that's pretty much every Windows system), you should have all the required prerequisites.

I hope you find it useful.

Disclaimer

It's 100% free for you to download and use. I think it's pretty good but as with most free things, I can't guarantee that. You make your own decisions if it's suitable for you.

 

Snowflake for SQL Server users – Part 3 – Core Architecture

The first thing to understand about Snowflake is that it has a very layered approach. And the layers are quite independent, including how they scale.

Cloud Provider Services

The lowest level isn't part of Snowflake; it's the services that are provided by the underlying cloud provider. As a cloud native application, Snowflake is designed to use services from the cloud provider that they are deployed to, rather than providing all the services themselves. At present, that means AWS or Microsoft Azure. Deployment on Google's cloud platform is in preview at this time.

Each deployment of Snowflake includes the upper three layers that I've shown in the main image above.

Storage Layer

This layer is exactly what it says. It uses storage from the cloud provider to provide a way to store anything that needs to be persisted. That includes, the obvious things like databases, schemas, tables, etc. but it also includes less obvious things like caches for the results of queries that have been executed.

Storage is priced separately from compute. While the prices differ across cloud providers and locations, the Snowflake people said they aren't aiming to make a margin much on the storage costs. They're pretty much passing on the cloud provider's cost.

In addition, when you're staging data (that I'll discuss later), you can choose to use Snowflake managed storage or your own storage accounts with a cloud provider to hold that staged data.

Compute Layer

This is the heart of where Snowflake make their income. The compute layer is made up of a series of virtual clusters that provide the compute power to work on the data. Importantly, each of the virtual clusters (called Virtual Warehouses) can independently access the shared underlying storage.

Compute is charged by consuming what are called "credits". I'll write more about those soon. What is interesting is that you only pay for compute while a virtual warehouse is running.

While there are some blocking commands, the idea is that you should do most of those in staging areas, to keep your shared storage accessible in a highly concurrent way. The aim is to have a virtual warehouse happily querying the data, while another virtual warehouse is in the middle of bulk loading other data. And the virtual warehouses can each be different sizes.

Global Services Layer

This is the layer where all the metadata and control lives. It's also where transactions are managed, and where security and data sharing details live. I'll describe each of its functions in future posts.

 

For an index to all posts in this series, see the first post here.

Book Review: Power BI MVP Book

Over the last few months, one of my Kiwi buddies (and fellow member of both the MVP and Microsoft Regional Director programs) Reza Rad has been organizing a bunch of us to write a book that's a collection of ideas from a number of MVPs. It's the Power BI MVP Book.

There are a whole lot of authors from a whole lot of different countries: Reza Rad, Anil Maharjan, Indira Bandari, Liam Bastick, Ken Puls, Jesus Gil, Thomas LeBlanc, Ike Ellis, Matt Allington, Leila Etaati, Markus Ehrenmüller, Ashraf Ghonaim, Eduardo Castro, Manohar Punna, Treb Gatte, Gilbert Quevauvilliers, Michael Johnson, Shree Khanal, Asgeir Gunnarsson, Greg Low, Gogula Aryalingam.

I've done these types of books before with the SQL Server MVP Deep Dives pair of books. They are a different book in that you're not getting a single story throughout the book. Instead, you're getting a whole set of independent chapters on a variety of topics related to Power BI.

The general idea of these books is to support a charity, and that's where anything that I would have earned from them is going.

Bottom line?

I hope you find this book useful. There is both a paperback and a Kindle edition. The Kindle eBook is far cheaper.

Greg's rating: you decide

Snowflake for SQL Server users – Part 2 – Cloud First Design

In recent years, I've done a lot of work in software houses (Microsoft calls them ISVs or Independent Software Vendors). Many of these software houses have worked out that they won't be able to just keep selling their on-premises applications because their customers are asking for cloud-based solutions.

And more importantly, the customers want the software houses to manage the applications rather than themselves. So, many of the software houses start trying to turn their on-premises applications into Software as a Service (SaaS) applications.

And the outcome? Is almost always really, really poor.

Cloud First

The real problem is that those applications haven't been written with a cloud-first mentality. They really are often the "square peg in a round hole". What often happens is that the cloud-based versions of the applications have large numbers of limitations compared to the on-premises versions of the same applications, because recreating everything in the cloud (when there were things that weren't designed for the cloud) is often nearly impossible.

I'm a big fan of Azure SQL Database, and they've done a great job on it (way better than almost any other application), but it's still quite a distance short of where we already were with SQL Server on-premises. I wish the marketing for the product would focus on how what is there (i.e. a truly amazing product) but the discussion always seems to be around what's missing, and how it compares to the on-premises product. In fact, I'm sure the entire reason that the Managed Instance versions of Azure SQL Database appeared were to address some of the shortcomings.

If the first time you'd seen SQL Server was to see Azure SQL Database, you'd come away saying how amazing it is. But if you've come from the on-premises product, chances are that you might be already using something that isn't there.

Nice to have a blank slate

Even if you were brand new to Azure SQL Database though, you'd find aspects of how the product is designed that are based on thinking from decades ago, and were designed for systems that were available back then. It's very hard to make major changes when you're bound to trying to keep backwards compatibility.

One key advantage that the team building Snowflake had was a clean slate where they could design a product that targeted cloud provider based services under the covers, instead of on-premises physical devices and operating systems.

 

For an index to all posts in this series, see the first post here.

Snowflake for SQL Server users – Part 1 – Why Snowflake?

A few months back, I started noticing that many of our clients had started to mention Snowflake.

In recent years, I've been in lots of planning and architectural meetings where there was already a presumption that AWS was being used rather than Azure. I put that down to a great selling job by the AWS people who got corporate IT folk locked into large enterprise agreements early. And so no matter what the technical question is, the answer will be something that runs on AWS.

I still think that Azure offers a far stronger cloud story than AWS but I'm looking at the whole end-to-end cloud-based story, particularly with platform services. I just see a stronger, more secure, and better envisaged story in Azure.

Cloud Transformation

A big part of this issue is that many organizations that I go into say they're making a "cloud transformation", yet what they're actually doing, is just moving a whole lot of virtual machines into a cloud hosting center.

That's just re-hosting; it's not making a cloud transformation.

For cloud-hosting providers, it's also a no-win game. When all you're doing is re-hosting VMs, you're mostly competing on price, and that's a race to the bottom that no-one really wins.

Data Warehouses

One bright note on the horizon though is around data warehouses. Almost all these corporates seem to get the idea that they want a cloud data warehouse as a service, not just a bunch of re-hosted virtual machines.

For customers who've already made the decision to use AWS (before we come into the meeting), when they look around for cloud data warehouses, Snowflake is one that's often chosen quickly.

A bonus for me, is that it's now available on both AWS and on Azure.

I've been spending a lot of time lately digging into Snowflake, and in this series of posts, I'll look at Snowflake from an existing SQL Server user's perspective.

Posts in the series

Part 1 – Why Snowflake? 
Part 2 – Cloud First Design
Part 3 – Core Architecture
Part 4 – T-Shirt Sizing
Part 5 – Editions and Security
Part 6 – Role Based Security
Part 7 – Authentication
Part 8 – Case-Sensitivity 
Part 9 – Stages
Part 10 – File formats
Part 11 – GET, PUT, SnowSQL 
Part 12 – Parallelism when loading files 
Part 13 – Programmable objects
Part 14 – Internal storage and micropartitions
Part 15 – Table types 
Part 16 – Primary and Foreign Keys
Part 17 – Clustering Keys for Tables
Part 18 – Time Travel
Part 19 – Fail Safe
Part 20 – Encryption

And more coming…

 

 

 

Opinion: Do your bulk data manipulation in T-SQL, not in row operations in SSIS

I really love SQL Server Integration Services (SSIS). In fact, I wish Microsoft was giving it far more love than it is. So many of my clients use it, but unfortunately, not everyone uses it well, or for the right things.

One really common mistake is to perform lots of row-by-row logic within the SSIS packages. Let's take an example of loading a fact table in a data warehouse as an example:

To load a fact table in a data warehouse that's based on a star schema, you need to find the dimension (usually surrogate) key for each dimension that's associated with the fact.

The Wrong Way

Awesome image by David von Diemar

Now clearly you could use a Lookup task in SSIS to go and find the first dimension key, based upon the dimension business ID. If it's present, you add it to your data flow. If it's not present, perhaps you infer a new row, and add the new row's dimension key to the data flow.

Then you go and do it all again for the next dimension, and so on, and so on.

There are situations where this might make sense, but most of the time, it doesn't.

A Better Way

As an alternative, you could use SSIS to just orchestrate some T-SQL to do the same job:

  • Stage the fact data in a table that has the required columns, plus a nullable column for every dimension key
  • Perform a single update on that staging table, with left joins to look up all the dimension keys and record them when present
  • Perform a single query for each dimension where you need to infer new keys and update the staging table
  • Push the staged data into the target fact table.

The difference in performance is often very, very large.

I know that doing the T-SQL work requires a different skill set to what's required for the SSIS work, but whenever you're dealing with a large amount of data manipulation in SSIS, or now in ADF (Azure Data Factory), consider whether you should just be orchestrating T-SQL instead.

 

Power BI: Creating an IsWeekday function in Power Query M Language

I spend a lot of time doing consulting that involves Power BI. One of the core skills for working with Power BI is getting yourself comfortable with the M language that's used in Power Query.

I was asked the other day, how to create a function that determines if a date is a weekday or a weekend. That's actually quite easy. The Date.DayOfWeek function can be adapted to do that.

In Power Query, go to the Edit Queries option and choose to add a new query. (You could also start by duplicating an existing query). Then on the View menu, choose the Advanced Editor and replace any code that's there with this:

Save it and call the query IsWeekday.

Date.DayOfWeek returns the day number in the week, but you need to tell it which day your week starts on. In this case, because I wanted to have Saturday and Sunday as weekend days, I started the week on Monday. That means that Saturday is then day 5 and Sunday is day 6 because the first day is day 0. So then all I need to do is check for a value less than 5.

If you need to change the weekend to say Friday and Saturday, you'd just need to replace the Day.Monday constant in the script with Day.Sunday.

Testing the function

Testing the function is also easy. Once you save the code, you'll see a template for calling the function:

Select a date, click Invoke and see the result:

 

 

Book Review: Pro Power BI Architecture

One of my Kiwi buddies who specializes in Power BI is Reza Rad. I was pleased to see he had a set of eBooks now on Power BI but was especially happy to see he had a book called Pro Power BI Architecture.

There are lots of books around to discuss how to use Power BI but there's been a real lack of books on architecting solutions using Power BI. So if you want to learn to develop dashboards or reports, this isn't the book for you. Reza has other books for that.

I enjoyed reading the book and I liked the degree of coverage it gave to these topics.

If you are looking for ways to integrate Power BI into your solutions architecture, this book is a pretty good starting point.

What I was really hoping for though, was more info on administration. Mind you, the book doesn't claim to provide that. I keep getting asked about materials around administration issues. Perhaps that's another edition for Reza to consider. But the architects who need high level overviews of all the key topics should be pretty happy.

Bottom line?

I enjoyed reading this book, and it's a pretty easy read. Great for architects considering Power BI.

Greg's rating: 7 out of 10

Bring your reports and analytics to life. Learn SQL Server spatial now.

I love working with SQL Server spatial. Don't just use boring tables and charts. Spatial data can really help to bring your reports and analytics to life.  Don't assume spatial is just about maps. (It's not). Don't assume you don't have spatial data. (I'm sure you have addresses, locations, etc.)

We've been working hard lately on getting many of our popular in-person courses, converted to being available online and on-demand. We're really proud to now have SQL Server Spatial for Developers and DBAs available.

We wanted to give you an experience that's as close as possible to a classroom course, but right in your own location, and at a time that suits you.

It's a detailed course. It's not just a set of videos. There are hands-on labs that you can do (we supply the data), quizzes that you can complete, and once you're even more awesome, you can get a certificate of completion.

It's $195 USD but best of all, use the coupon GOSPATIAL at the checkout, and you can enrol for $95 USD. (Until the end of January) Just click on the enrol button (says $195), and then click Add Coupon on the next page.

Check it out here: https://training.sqldownunder.com/p/sqlserver-spatial-data

And check out all our courses here: https://training.sqldownunder.com

More courses are on the way very soon.