Snowflake SQL for SQL Server Users – UNDROP

Overall the SQL language for Snowflake is somewhat of a subset of what's available in SQL Server with T-SQL. But there are some commands that really are useful. UNDROP is one of them.

I've talked in earlier posts about how Snowflake stores data in immutable micropartitions, and in other posts mentioned the concept of timetravel. Well, an added advantage of having older data still accessible is that you can quickly recover from "incidents".

Ever have a sinking feeling when you've run a script that drops an object and suddenly realized you were connected to the wrong database or server?

If you've been in this industry for any length of time, I know you have.

Things go wrong, no matter how careful you are. So it's good to have an option to help you out.

can be used to bring back a table that's previously been dropped. By default, that works for up to 24 hours but it's also possible to have longer retention periods.

When things really go wrong

But it's not just tables that can be undropped. For the really adventurous, there are also:

And these commands are fast. We're talking milliseconds.

There's a psychological aspect to this as well. Having the ability to recover so easily is also useful for increasing the confidence of the people who are making the changes. (Note: I'm not suggesting that should make people more careless. It's just important to know there is a better fallback).

Snowflake SQL for SQL Server Users – Sequences: The good and the not so good

I've been working to convert our SQL Down Under sample databases like PopkornKraze across to Snowflake. One of the areas that I ran into real challenges with was sequences.

The good news

The syntax for creating and accessing them is basic enough.

is enough.  (Note the double-quotes are mine as I don't want the name auto-wrapped to SEQUENCENAME).

And there are some of the same options that SQL Server has:

  • START WITH = 1            (you can even leave out the equals sign)
  • INCREMENT BY = 1      (again the equals is optional)
  • COMMENT = 'some comment'

Snowflake syntax also supports IDENTITY(seed, increment) if you want to define that at a table level. That is like a single table sequence but might help in migrating SQL Server code.

What I do really like are the options for replacing or only creating if not exists:

We can't do that in SQL Server today.  (I've complained to the SQL Server team that these types of options don't exist for sequences).

Getting the next value for a sequence is easy enough:

Like SQL Server, there's no guarantee that you won't have gaps in your numbers that are allocated.

There is a table-level function called GETNEXTVAL() that takes the sequence name as a parameter. It's a one row table function that generates a unique value. It's used for precise control of sequence generation when many tables are involved.

The not so good news

First up, there's no concept of anything like currval in Oracle (current value), or @@IDENTITY, or SCOPE_IDENTITY() in SQL Server. The documentation argues that this is an intentional omission as it encourages row by row coding. Can't say I buy that argument.

The biggest challenge (by far) is that you can't reset the next value to be issued by either an IDENTITY or Sequence. This really needs to get fixed. Lots of people are complaining about it in the Snowflake Lodge (user forums) and it just isn't a good thing. Here's an example:

  • You define a table using an IDENTITY column or a Sequence, let's say for a primary key.
  • You need to load existing data into it. This works easily as there is no concept of something like SET IDENTITY INSERT ON/OFF. You can just insert.
  • Next row that gets added to the table will now probably have a duplicate primary key value added.
  • And as Snowflake also doesn't check primary keys (or most other constraints), you'll really just end up with multiple rows with the same primary key.

ALTER SEQUENCE does allow you to change the increment, but won't let you change the next value to be issued. (Or the seed).

You have to drop and recreate the sequence. And now here's the even-nastier part: if you drop the existing sequence, it doesn't stop you doing it, but the column that was using it still references it. So even when you recreate a sequence with the same name, your next insert will blow up with an object not found. (Again, people complaining about this in the Snowflake lodge).

This means that you can't add sequences as defaults to tables until you've finished loading all your data. That's just not sensible, and breaks all the rules about separating DDL and data. The DDL that you write for a table should not be dependent upon what data happens to be loaded into a table.

How could you then write DDL that's used in multiple places? (Unless you just always start with a gigantic seed value).

Opinion

This aspect of the product needs to be fixed. It's hard to believe it's so broken. Wish it wasn't so.

 

Snowflake SQL for SQL Server Users – SQL Language Differences

I posted last time that I'd completed my Part 1 eBook on Core Concepts. The Part 2 eBook is on the differences in the SQL language between Snowflake and SQL Server.

This next series of blog posts will focus on the query language, particularly on DML (data manipulation language). DDL (data definition language i.e. CREATE, ALTER, DROP) differences will be covered in the Part 3 series on administration.

Snowflake has quite a rich SQL language but it's quite different to the T-SQL offered by SQL Server in many ways.  I'm planning to work right through the Snowflake SQL language documentation and try everything in it, so you don't have to.

In this series of posts though, I'll look primarily at what's different, not on what's the same. (Detailed coverage of the language itself will be in our upcoming online course on Snowflake SQL. It's being built now).

This post will serve as an index to the other posts in this series.

You'll find the other posts in the list below:

Part 1: Sequences: The good and the not so good

 

 

 

Snowflake for SQL Server Users – A Christmas Present

I've had a lot of great feedback on my series of blog posts about Snowflake, and explaining the concepts for an existing SQL Server audience.

Just out for Christmas is a new free eBook for our SDU Insiders. I'm preparing a three part series on Snowflake for SQL Server users:

  • Part 1: Core Concepts (available now)
  • Part 2: SQL Language (coming soon)
  • Part 3: Administration (sometime after that :-))

You can get part one now! And did I mention free? (for SDU Insiders)

You'll find it here:

http://snowflakeforsqlserver.sqldownunder.com

Enjoy !

Snowflake for SQL Server Users – Part 20 – Encryption within Snowflake

As with most products today, Snowflake has substantial handling of encryption to protect client data.

All editions claim to provide "Always-on enterprise grade encryption in transit and at rest".

In Transit

First, for connections, HTTPS and TLS 1.2 are used throughout the system.

If the customer is using external stages (holding locations in cloud-storage accounts), the data can be encrypted while stored in the stage, and then travel encrypted into Snowflake. To do that, you create an "encrypted stage" where you tell Snowflake the master encryption key (i.e. the client side key) when creating the stage:

This is useful in allowing you to create an end-to-end encryption while ingesting data into Snowflake.

At Rest

Regarding "at rest", I've previously described how micropartitions are used to store data in data files. These data files are then encrypted (using AES-256) before being stored.  But how it's handled changes with different editions of Snowflake.

There are four levels of keys used:

  • The root key
  • Account master keys
  • Table master keys
  • File keys

With Standard and Premier editions, a different file key is used each month (aka key rotation). So a single "active" key is used to encrypt your first month's data, then a different key is used to encrypt the next month's data, and so on. Previous keys that are no longer "active" (aka retired) are only used for decryption.

Annual Rekey

Enterprise edition and above offer "Annual rekey of all encrypted data". I was a bit puzzled about how that worked at first. I imagined that once a year, all the existing data would get decrypted and rekeyed. I was thinking that would be quite an expensive operation on a large data warehouse.

What happens instead, is that when any key has been retired for a year, any data encrypted by it is decrypted and re-encrypted using a new key. So if you have those editions, if you store data this month, it will have a single key. In a year and one month's time, it will get rekeyed. And again in two years and one month's time, and so on.

Business Critical

The Business Critical edition takes things further. As well as providing  HIPAA and PCI compliance, it offers "Tri-Secret Secure using customer-managed keys". In this case, Snowflake combines the key that you provide with a Snowflake-maintained key to create a composite master key.

As well as having the obvious benefits in letting you manage access to the data more closely, you can immediately remove all access if needed (i.e. closing an account, or recovering from a breach).

For an index to all posts in this series, see the first post here.

Snowflake for SQL Server Users – Part 19 – Fail-Safe

In the previous post in this series, I talked about how historical data in Snowflake can be accessed by a concept called Time Travel. But Time Travel isn't the only way to get to historical data.

In Snowflake, Fail-safe is a system that provides protection against system failures and other nasty events like hardware failures or serious security breaches.

Immediately after the Time Travel retention period ends, a Fail-safe period begins.  It lasts for a non-configurable period of seven days.

Fail-safe isn't intended for users to access like they do with Time Travel. Instead, Fail-safe data is held so that Snowflake support can recover data that has been lost or damaged.

What about backup?

With Snowflake, there is no concept of a standard backup. The argument is that their multi-data center and highly redundant architecture almost removes the need for backup. And the thinking is that Fail-safe removes that final risk.

The only issue I see with that logic, is the same as with most cloud-based databases and data warehouses: Users sometimes want to keep point in time backups over long periods. I don't currently see any option available for this within Snowflake. But as I mentioned, they aren't alone in that.

For an index to all posts in this series, see the first post here.

Snowflake for SQL Server Users – Part 18 – Time travel

One really important concept in data warehouses is the idea of versioning. Imagine that I have customers broken into a set of business categories. And then I change which customers are in which categories. For so many transactional systems, if I run a sales report for last year, the customers would now appear in the current business categories, rather than the ones they were part of last year.

In Kimbal-based designs, that's part of the rationale for what are called Slowly-Changing Dimensions (SCDs).

SQL Server Temporal Tables

SQL Server 2016 introduced temporal tables. Each temporal table has two components: one table that holds the current data, and another table that holds the history for the same data i.e. the previous versions of the rows. While the SQL Server implementation is interesting, in practice, I've found it lacking in many ways.

The biggest challenge is the lack of temporal joins. Even though I can view the contents of a table at a particular point in time, and I can view all the versions of rows from the table over a period of time, there are real challenges when you need to work across joins. There's no way to get rows from one table, based upon the timeframe of the rows from another table.  When I created the WideWorldImporters sample databases for SQL Server 2016, you'll find a whole bunch of cursor-based code in the procedures that extract data changes. I really didn't want to do that, but couldn't see any other way to achieve it.

Snowflake and Time Travel

The standard mechanism for accessing historical data in Snowflake is to use what's called Time Travel. The core things that Time Travel lets you do are:

  • Run queries to see the previous version of data at a given time
  • Create a clone of a table based upon a previous version of its data (you can also do this for schemas and databases)
  • Recover previous versions of objects that have been dropped

Because this historical data is available, there is even an UNDROP command that works for tables, schemas, and databases.

When you want to use historical versions of the data, in a SELECT statement (or in a CREATE CLONE statement), you can specify AT for a specific time, or BEFORE to exclude the given time. Then when you specify the time, you have three options:

  • Specify a specific time via the TIMESTAMP option
  • Specify a relative time from the current time by using the OFFSET option with a number of seconds
  • Specify a particular statement via a Query ID

The queries look like this:

Query ID

This last option is a really curious one for SQL Server people. Snowflake keeps track of previous queries that it has executed. Each one is identified by a Query ID.

There are many things you can access via this Query ID like the query plan involved.

Retention of History

The retention period for Time Travel is based upon the edition of Snowflake that you are using.

  • All editions are able to use a retention period of one day
  • Enterprise Edition allows up to 90 days for permanent objects, but only one day for transient and temporary objects.

Also in Enterprise Edition, you can specify a specific period for specific objects by using the DATA_RETENTION_TIME_IN_DAYS parameter when you are creating the objects.

If you really don't want to use Time Travel (and the space it will use), you can set DATA_RETENTION_TIME_IN_DAYS to 0 at the account level.

For an index to all posts in this series, see the first post here.

Snowflake for SQL Server Users – Part 17 – Data clustering

In SQL Server, most tables benefit from having a clustering key i.e. the column or columns that the table is logically sorted by. (Note: much old training material and documentation used to say "physical order" and that's never been true). To do that, SQL Server stores the data within the pages in a sorted way, and maintains a doubly-linked logical chain of pages. In addition, it maintains pointers to the data by creating an index on it.

By comparison, Snowflake's design tends to encourage you to avoid creating clustering keys on tables. For most tables, the data will automatically be well-distributed across the micropartitions.

You can assess how well that's working with seeing if there are multiple partitions with overlapping values. You can see an image of that in the documentation here:

https://docs.snowflake.net/manuals/user-guide/tables-clustering-micropartitions.html#clustering-depth-illustrated

The more overlap there is, the greater the clustering depth. You can see that by querying the SYSTEM$CLUSTERING_DEPTH system function. It's also possible to see all the most important clustering details (including the depth) by querying the SYSTEM$CLUSTERING_INFORMATION system function.

Sometimes you need to cluster

This is all good, but sometimes you need to cluster a table, particularly if you've had a lot of INSERT, UPDATE, DELETE style operations that have been executed. (Note: this is only an issue if it's changed the columns that were involved in the auto-clustering and has introduced skew and additional clustering depth).

The aim of the clustering (which, as noted, isn't for all tables) is co-locate the related table data the same micro-partitions, and to minimize clustering depth.

Generally, you'll only do this for very large tables. And sure signs that you need to do it are where your queries have slowed down markedly, and the clustering depth has increased.

Micropartition pruning is eliminating micropartitions that aren't needed for a particular query. It's somewhat like partition elimination and clustered columnstore segment elimination in SQL Server. And it's really important for query performance.

But there's a cost

While it might seem obvious to then keep all the tables clustered like we often do in SQL Server, the more frequently the data in the table changes, the more work is involved in maintaining the clustering.

However, if you have tables that don't change much and are queried all the time, clustering could produce a great outcome.

A few other notes

It's also worth noting that unlike applying a clustering key to a SQL Server index (via a primary key or a clustered index), the change isn't applied immediately. It's done in the background.

You also want to avoid having high cardinality columns as clustering keys. As an example, if you have TIMESTAMP columns (similar to SQL Server datetime or datetime2 columns), you'd be better off adding an expression (like a computed column) that truncated that to a date, and then clustering on that.

Reclustering a table in Snowflake is automatic and is performed when it becomes quite skewed.

 

For an index to all posts in this series, see the first post here.

 

Snowflake for SQL Server Users – Part 16 – Primary and Foreign Key Constraints

The Usual Situation

In general database terminology, a primary key is one or more columns that can be used to identify a particular row in a table. The key needs to be unique, and it can't be null. An example would be a CustomerID or CustomerKey in a Customers table.

A foreign key is one or more columns that refer to a key in another table. A common example would be a CustomerID column in an Orders table. Foreign keys can be nullable and are checked when they contain a value.

Foreign Keys in Data Warehouses

I like to see foreign keys in data warehouses. The most common objection to them is performance. And yet when I ask those same people if they've ever tested it, I'm usually told that they haven't but that their brother's friend's cousin read it somewhere on the Internet.

Don't be that person. 

I also hear "the app does that", and so on. Yet, almost every time I check a substantial data warehouse for consistency, when it's run for a while without foreign keys being checked, I invariably find problems. When I show them to people, I then hear "oh yep, we had that bug a while back…" and so on, but there's almost always an issue.

Even if the app checks the data, other apps touching the same data might not. And what if your ETL (or ELT) processes have bugs? You need to find out about it immediately.

Primary and Foreign Keys in Snowflake

I wish it wasn't so, but while you can define primary and foreign keys on tables in Snowflake, note this:

Yep, they are ignored. They are not checked at all.

Same deal for unique constraints. (Mind you, unique constraints are really a broken concept in SQL Server as well. While they are checked, SQL Server only allows a single row where the key is null. That's not good either).

And that's why there are people complaining in the Snowflake forums about issues caused by duplicate primary keys.

I really think this aspect of the product needs to be reconsidered and I encourage the Snowflake team to do so. For many sites I work at, this single aspect would be a showstopper.

 

For an index to all posts in this series, see the first post here.

 

 

 

Snowflake for SQL Server Users – Part 15 – Table types

Snowflake has a richer set of options for how tables are created, than we have in SQL Server.

CREATE TABLE Variants

As well as a CREATE TABLE statement as you would already be familiar with in SQL Server, Snowflake offers these variants:

CREATE TABLE tablename AS SELECT

This is basically similar to a SELECT INTO in SQL Server. It executes the SELECT query and creates a table from the results (including the data).

CREATE TABLE tablename LIKE

This is an interesting variant. It creates a table with the same schema as an existing table but without any data i.e. it creates a new empty table based upon the design of another table.

CREATE TABLE tablename CLONE

This is another interesting variant. It clones one table to create another table. It's similar to the LIKE option but also includes all the data.

Table Types

Permanent

Permanent tables are standard table types that are pretty much the same as the equivalents in SQL Server.

TEMPORARY

This is similar to a temporary table in SQL Server. The table and its data are retained until the end of the user's session. The syntax supports a concept of LOCAL TEMPORARY and GLOBAL TEMPORARY but these options have no affect. A standard TEMPORARY table is created.

Note that TEMPORARY can be abbreviated to TEMP, and has a synonym of VOLATILE.

TRANSIENT

These tables aren't dropped at the end of a user session and stay until someone drops them. They are also visible to all users.

A key difference with them though is that they don't offer the same level of protection as standard tables. They are more like an eventually-consistent table where you might lose data if the system fails. They should only be used for data that can easily be recreated if needed.

 

For an index to all posts in this series, see the first post here.