I spend quite a bit of each year working with spatial data (i.e. mostly mapping data but sometimes other data). One of the constant challenges is where to find the spatial data. Another challenge is where to find services that can help you to work with the data.
For Australian data, if you haven't considered Mappify.io, I think you should take a look.
You need to create an account before you start but it's free to create an account to try it out. Then there are a number of services.
You can do simple web-based geocoding if you want. Enter values or upload a CSV and click Geocode.
You get back a map and results that you can export:
And of course there's an API that allows you to do this programatically.
This takes a list of latitudes and longitudes and reverse geocodes them.
And again, a map and an exportable set of results:
The Area Code option determines the area that a location is in. You can choose how the area is categorized:
For example, POA returns the postcodes.
I particularly love the option for cleaning addresses. Pass in a full address and have the system break the data into standard columns:
There are many other options as well, like routing and driving directions and distances, spatial analytics, etc. This is a great tool and a great service.
Mappify.io currently say that if you create an account, the first 2,500 requests every day are free. After that they're 0.1¢ each. That seems a bargain.
Learning about Spatial Data
If you're not using spatial data in SQL Server, you should be. And if you need to learn about spatial data in SQL Server, check out our popular on-demand online course:
There are many things I like about Snowflake. How they handle case and collations is not one of them.
There are currently no rich options for handling case like you have in SQL Server, with detailed options around both collations, and case sensitivity.
I've previously written about how I think that case-sensitivity is a pox on computing. I see absolutely no value in case-sensitivity in business applications, and a significant downside.
Case preservation is a different thing. I expect systems to remember the case that I define things with, but 99.9% of the time, I want to search for them without caring about case. All that case-sensitivity does provide is the ability to have two objects in the same scope that differ only by different capital letters in their names. That's usually the result of lazy coding, and almost never a good idea.
Snowflake is basically case-sensitive. I wish it wasn't. There are, however, some workarounds.
To get around some of the problems that case-sensitivity causes, Snowflake automatically upper-cases object names when you define or use them. By default, Snowflake treats the objects Customers, customers, and CUSTOMERS as the same object. In fact, if you execute a statement like:
CREATE TABLE Customers
what it will create is a table called CUSTOMERS. If you execute
SELECT 2 AS Value;
you'll get back the value 2 in a column called VALUE, not the column name that you asked for.
That breaks my basic wish (in any language) for case-preservation. Most objects that you see in almost every Snowflake presentation have names that are all capitalized.
You can, however, get around this by quoting each name with double-quotes.
CREATE TABLE "Customers"
And then you need to do that for every table, every column, every object, etc. from then on. If you execute:
SELECT 2 AS "Value";
you'll get the value 2 with the column name that you're after.
To get around the idea that most people won't want to compare strings in a case-sensitive way, they've created some different operators to deal with case. For example, if you use
WHERE "CustomerName" ILIKE 'Fred%'
you get a case-insensitive version of LIKE.
Added To My Wish List
The Snowflake people really need to fix how they handle case. You can see from the requests in their user forums that I'm not the only one that thinks so.
This aspect of the product feels very Oracle-like, and also feels like being back in the 1960s. Humans don't like screaming snake case. I really hope they will fix it soon as it's currently one of the the weakest aspects of the product.
Like most products, Snowflake comes in a number of editions, and you can see the current editions in the main image above. (Keep in mind that they could always change at any time and to check their site for the current options).
First thing I need to say is that I really like the way that most of the SQL code surface is pretty much identical across editions. I wish that was complete coverage but it currently doesn't include materialized views.
Note: That was a great change when the SQL Server team did the same back in SQL Server 2016 SP1.
There is no free edition like we have with SQL Server Express. Similarly, there's no free option for developers like we have with SQL Server Developer Edition. That's not surprising though, as they aren't the cloud provider, they are purchasing services from cloud providers. I find that the Standard edition is pretty low cost though: you only pay a fairly low amount for storage, and you only pay for compute when you use it. So that's not too bad.
The main difference between Standard and Premier is that the latter comes with premier support. So that's not a bad distinction from say development tasks, to production tasks. I'd rather see that as just a single edition, with support an optional extra over all editions.
Snowflake has a feature called Time Travel. This allows you to see what data looked like at earlier times. It's a bit like temporal tables but also quite different to it. I'll talk more about it in another post.
Standard and Premier both have one day of time travel though, and Enterprise edition takes you up to 90 days. I like to see that sort of feature used as a differentiator between editions. It mostly wouldn't require code changes when working with different editions.
Business Critical basically introduces more security. It adds HIPAA and PCI compliance, and the ability to use customer-managed encryption keys. I can't say that I love the idea of core compliance as a distinction between editions. Everyone's data is important to them. Customer managed keys are a good edition differentiator though.
Snowflake data gets encrypted anyway, and with a key that changes each month (for new data). But on lower editions, it doesn't get re-keyed. What Business Critical also adds is annual key rotation. Data that's a year old gets decrypted and re-encrypted with a new key.
VPS or Virtual Private Snowflake is for people who can't tolerate the idea of any sort of shared Snowflake infrastructure. The Snowflake team do a completely separate deployment of the whole Snowflake stack, just for each customer. It's super expensive (I heard it starts at over $50M AUD) and so I can't imagine too many customers using it, but I'm sure there will be a few, including right here in Australia.
I heard that VPS was only available on AWS at the time of writing, but I'm sure that will change. And I'm guessing if you front up with $50M+, and say you want it on Azure, it's unlikely they'd say no.
I've been in the IT industry a long time. I see trends come and go. (Mostly they go). At this point, I think I'm an OK judge of what's going to fly and what isn't. (Far from perfect but OK).
One thing that always puzzles me though is the way the word Modern is used as a put-down for things that aren't modern, as though it's a synonym for the word Better.
My favorite over the years has been the SQL language. I cannot tell you how many times I've heard about its imminent death, and how we need a modern alternative.
And yet, time after time, even the products that claimed they didn't need a SQL layer seem to end up adding one.
I think this comes from a type of brashness that I see in many new developers where they think that the output of their last day's thinking is somehow vastly superior to all the thinking that's gone on in the industry over decades.
Guess what? It usually isn't.
It was pretty funny watching products like MongoDB appear and claiming to not need all that pesky ACID transaction stuff. And after a few disasters from the lack of it, then add it into the product as a feature.
Wow, who knew that transactions and consistency actually mattered in most applications?
We aren't all writing blog post or chat applications.
It was also pretty funny to see NoSQL get redefined as Not Only SQL.
Modern Data Warehouses
I like to keep across most data-related trends and products, even if I don't plan to use them. I want to be able to discuss them sensibly and understand the pros and cons of each.
I spend some time every single day, learning about data-related products, applications, and languages.
Over many years, I've seen what works and what doesn't when it comes to data warehouses, and have lost count of how many shiny new things were going to fundamentally change how we work. The put-down is always that you want a modern data warehouse, not one of those old ones that required hard work.
Guess what? Building a good data warehouse is hard work.
As an example, the next time you hear someone tell you that you should just apply your analytics tools directly over your transactional systems, at least stop and ask yourself how that would be a good idea.
In other areas of your life, I'm sure you have a built-in detector for snake oil salesman. It needs to be applied to technology too.
When Hadoop first appeared, I spent quite a bit of time checking it out. Then when the SQL Server team became enamoured with it and added HDInsight (their flavor of it), I spent even more time checking it out. I even did the Big Data certification in the Microsoft Professional Program that I wrote about recently.
And the more and more I looked into it, and the more and more that I played around with it, I kept feeling Is that it?
There were certainly possible use cases for it, but almost every time that I saw it being forced into a solution, it was the wrong tool. So why did the technical people involved want to push it in? Either they just wanted to learn about the shiny new thing to enhance their CVs, or somehow they were bowled over by the hype.
I remember writing blog posts back in 2013 asking if most use of Big Data was mostly just Big Hype instead.
Tools aren't better just because they're considered modern. Don't get lost in the hype.
I mentioned in my last post in this series, that the Compute layer of Snowflake is basically made up of a series of Virtual Warehouses (VWs). Each VW is an MPP (massively parallel processing) compute cluster that can comprise one or more compute nodes.
The number of nodes in the compute cluster is called its "size" and the sizing options are made to resemble T-Shirt sizing, as you can see in the main image above.
Note that XS (extra small) is the smallest VW size. Using an XS for one hour consumes one Snowflake credit. In Australia, in Azure, right now, that's a little over $2 USD.
The number of credits used per hour for a VW is directly related to the number of compute nodes that it contains. So a 2XL VW consumes 32 credits per hour.
The other important aspect of this sizing is the number of threads. Currently, there are 8 threads per compute node.
So the number of concurrent threads for a VW goes from 8 at the XS level, up to 1024 at the 4XL level.
I'll talk more about threads later but they become important when you're trying to get good parallel loading of files happening, and when you want significant concurrency in query execution.
When I was doing some Snowflake training recently, one of the students in the class asked what utility they should use on Windows for splitting a large file into sections. They wanted to split files for better bulk loading performance, to be able to use all available threads.
On Linux systems, the split command works fine but the best that most people came up with on Windows was to use Powershell. That's a fine answer for some people, but not for everyone.
Because the answers were limited, and people struggled to find a CSV splitter, I decided to fix that, and create a simple utility that's targeted as exactly this use case.
SDU_SplitFile is a brand new command line utility that you can use to split text files (including delimited files).
The required parameters are positional and are as follows:
<InputFilePath> is the full path to the input file including file extension <MaximumLinesPerFile> is the maximum number of lines in the output file <HeaderLinesToRepeat> is the number of header lines to repeat in each file (0 for none, 1 to 10 allowed) <OutputFolder> is the output folder for the files (it needs to already exist) <Overwrite> is Y or N to indicate if existing output files should be overwritten
There is one additional optional parameter:
<OutputFilenameBase> is the beginning of the output file name – default is the same as the input file name
Let's take a look at an example. I have a file called Cinemas.csv in the C:\Temp folder. It contains some details of just over 2000 cinemas:
I'll then execute the following command:
This says to split the file Cinemas.csv that's currently in the current folder with a maximum of 200 rows per file.
As you can see in the previous image, the CSV has a single header row. I've chosen to copy that into each output file. That way, we're not just splitting the data rows, we can have a header in each output file.
We've then provided the output folder, and also said Y for overwriting the output files if they already exist.
And in the blink of an eye, we have the required output files, and as a bonus, they're all already UTF-8 encoded:
It's easy to get the tool and start using it. You can download a zip file containing it here.
Just download and unzip it. As long as you have .NET Framework 2.0 or later (that's pretty much every Windows system), you should have all the required prerequisites.
I hope you find it useful.
It's 100% free for you to download and use. I think it's pretty good but as with most free things, I can't guarantee that. You make your own decisions if it's suitable for you.
The first thing to understand about Snowflake is that it has a very layered approach. And the layers are quite independent, including how they scale.
Cloud Provider Services
The lowest level isn't part of Snowflake; it's the services that are provided by the underlying cloud provider. As a cloud native application, Snowflake is designed to use services from the cloud provider that they are deployed to, rather than providing all the services themselves. At present, that means AWS or Microsoft Azure. Deployment on Google's cloud platform is in preview at this time.
Each deployment of Snowflake includes the upper three layers that I've shown in the main image above.
This layer is exactly what it says. It uses storage from the cloud provider to provide a way to store anything that needs to be persisted. That includes, the obvious things like databases, schemas, tables, etc. but it also includes less obvious things like caches for the results of queries that have been executed.
Storage is priced separately from compute. While the prices differ across cloud providers and locations, the Snowflake people said they aren't aiming to make a margin much on the storage costs. They're pretty much passing on the cloud provider's cost.
In addition, when you're staging data (that I'll discuss later), you can choose to use Snowflake managed storage or your own storage accounts with a cloud provider to hold that staged data.
This is the heart of where Snowflake make their income. The compute layer is made up of a series of virtual clusters that provide the compute power to work on the data. Importantly, each of the virtual clusters (called Virtual Warehouses) can independently access the shared underlying storage.
Compute is charged by consuming what are called "credits". I'll write more about those soon. What is interesting is that you only pay for compute while a virtual warehouse is running.
While there are some blocking commands, the idea is that you should do most of those in staging areas, to keep your shared storage accessible in a highly concurrent way. The aim is to have a virtual warehouse happily querying the data, while another virtual warehouse is in the middle of bulk loading other data. And the virtual warehouses can each be different sizes.
Global Services Layer
This is the layer where all the metadata and control lives. It's also where transactions are managed, and where security and data sharing details live. I'll describe each of its functions in future posts.
Over the last few months, one of my Kiwi buddies (and fellow member of both the MVP and Microsoft Regional Director programs) Reza Rad has been organizing a bunch of us to write a book that's a collection of ideas from a number of MVPs. It's the Power BI MVP Book.
There are a whole lot of authors from a whole lot of different countries: Reza Rad, Anil Maharjan, Indira Bandari, Liam Bastick, Ken Puls, Jesus Gil, Thomas LeBlanc, Ike Ellis, Matt Allington, Leila Etaati, Markus Ehrenmüller, Ashraf Ghonaim, Eduardo Castro, Manohar Punna, Treb Gatte, Gilbert Quevauvilliers, Michael Johnson, Shree Khanal, Asgeir Gunnarsson, Greg Low, Gogula Aryalingam.
I've done these types of books before with the SQL Server MVP Deep Dives pair of books. They are a different book in that you're not getting a single story throughout the book. Instead, you're getting a whole set of independent chapters on a variety of topics related to Power BI.
The general idea of these books is to support a charity, and that's where anything that I would have earned from them is going.
I hope you find this book useful. There is both a paperback and a Kindle edition. The Kindle eBook is far cheaper.
Greg's rating: you decide
Note: as an Amazon Associate I earn (a pittance) from qualifying purchases but whether or not I recommend a book is unrelated to this. One day it might just help cover some of my site costs. (But given the rate, that's not really likely anyway).
In recent years, I've done a lot of work in software houses (Microsoft calls them ISVs or Independent Software Vendors). Many of these software houses have worked out that they won't be able to just keep selling their on-premises applications because their customers are asking for cloud-based solutions.
And more importantly, the customers want the software houses to manage the applications rather than themselves. So, many of the software houses start trying to turn their on-premises applications into Software as a Service (SaaS) applications.
And the outcome? Is almost always really, really poor.
The real problem is that those applications haven't been written with a cloud-first mentality. They really are often the "square peg in a round hole". What often happens is that the cloud-based versions of the applications have large numbers of limitations compared to the on-premises versions of the same applications, because recreating everything in the cloud (when there were things that weren't designed for the cloud) is often nearly impossible.
I'm a big fan of Azure SQL Database, and they've done a great job on it (way better than almost any other application), but it's still quite a distance short of where we already were with SQL Server on-premises. I wish the marketing for the product would focus on how what is there (i.e. a truly amazing product) but the discussion always seems to be around what's missing, and how it compares to the on-premises product. In fact, I'm sure the entire reason that the Managed Instance versions of Azure SQL Database appeared were to address some of the shortcomings.
If the first time you'd seen SQL Server was to see Azure SQL Database, you'd come away saying how amazing it is. But if you've come from the on-premises product, chances are that you might be already using something that isn't there.
Nice to have a blank slate
Even if you were brand new to Azure SQL Database though, you'd find aspects of how the product is designed that are based on thinking from decades ago, and were designed for systems that were available back then. It's very hard to make major changes when you're bound to trying to keep backwards compatibility.
One key advantage that the team building Snowflake had was a clean slate where they could design a product that targeted cloud provider based services under the covers, instead of on-premises physical devices and operating systems.
A few months back, I started noticing that many of our clients had started to mention Snowflake.
In recent years, I've been in lots of planning and architectural meetings where there was already a presumption that AWS was being used rather than Azure. I put that down to a great selling job by the AWS people who got corporate IT folk locked into large enterprise agreements early. And so no matter what the technical question is, the answer will be something that runs on AWS.
I still think that Azure offers a far stronger cloud story than AWS but I'm looking at the whole end-to-end cloud-based story, particularly with platform services. I just see a stronger, more secure, and better envisaged story in Azure.
A big part of this issue is that many organizations that I go into say they're making a "cloud transformation", yet what they're actually doing, is just moving a whole lot of virtual machines into a cloud hosting center.
That's just re-hosting; it's not making a cloud transformation.
For cloud-hosting providers, it's also a no-win game. When all you're doing is re-hosting VMs, you're mostly competing on price, and that's a race to the bottom that no-one really wins.
One bright note on the horizon though is around data warehouses. Almost all these corporates seem to get the idea that they want a cloud data warehouse as a service, not just a bunch of re-hosted virtual machines.
For customers who've already made the decision to use AWS (before we come into the meeting), when they look around for cloud data warehouses, Snowflake is one that's often chosen quickly.
A bonus for me, is that it's now available on both AWS and on Azure.
I've been spending a lot of time lately digging into Snowflake, and in this series of posts, I'll look at Snowflake from an existing SQL Server user's perspective.