SQL: Understanding Change Data Capture for Azure SQL Database – Part 2 – How does it work?

In the part 1 of this series, I discussed the positioning of Change Data Capture. In part 2, I want to cover how it works.

Log Reading

There are many ways that you can output details of changes that occur in data within SQL Server. Many of those methods require actions to occur at the time the data change is made. This can be problematic.

The first problem with this, is the performance impact on the application that's making the change. If I update a row in a table and there is part of the process that writes details of that change to some type of audit or tracking log, I've now increased the work that needs to happen in the context of the application that's making the change. Generally what this means, is that I've slowed the application down by at least doubling the work that needs to be performed. That might not be well-received.

The second potential problem is even nastier. What if the change tracking part of the work fails even though the original update works? If I've done the change tracking work in the context of the original update (particularly if it's done as part of an atomic process), by adding tracking I might have broken the original application. That certainly wouldn't be well-received.

So what to do?

The best answer seems to be to work with the transaction log that's already built into SQL Server. By default, it does have details of the changes that have been occurring to the data. It can be read asynchronously so delays in reading it mostly won't affect the original data changes at all (there are only rare exceptions to this). If the reading of the logs failed, the problem can be corrected and the reading can be restarted, all again without affecting the main updates that are occurring.

And that's what Change Data Capture does. It uses the same log reader agent that has been part of SQL Server for a very long time. Previously though, it was used for Transactional Replication. In fact if you use both Transactional Replication and Change Data Capture on the same SQL Server system, they share the same instance of the log reader. The SQL Server Agent is used to make them both work.

SQL Server Agent – Isn't that missing?

When we're working with Azure SQL Database, things are a bit different. Currently, we don't have any concept of Transactional Replication. That could change but right now, it's not there. So sharing the log reader isn't an issue.

But I also mentioned that with SQL Server, it was the SQL Server Agent that kicks off the log reading agent. And with Azure SQL Database, we don't have SQL Server Agent either !

The Azure SQL Database team have instead provided a scheduler that runs the log reader (called the capture), and also runs the required clean-up tasks. SQL Server had another agent to perform clean-up. This is all automated and requires no maintenance from the user.

Change Data Capture (CDC) Data Flow

The data flow with CDC is basically like the following:

CDC Data Flow

  1. The original app sends a change (insert, update, delete) to a table in the Azure SQL Database.
  2. The change is recorded in the transaction log.
  3. Some time later (usually not long though), the change is read by the capture process and stored in a change table.
  4. The Data Warehouse (DW) or target system or ETL system makes a call to a set of CDC functions to retrieve the changes.

Everything in the dotted box above is part of, and contained within, the Azure SQL Database.

Upcoming

In the next section, I'll show you the code that's required, and show you the changes that occur to the Azure SQL Database when you enable CDC.

  1. Why use Change Data Capture for Azure SQL Database?
  2. How Change Data Capture works in Azure SQL Database
  3. Enabling and using Change Data Capture in Azure SQL Database
  4. Change Data Capture and Azure SQL Database Service Level Objectives
  5. Accessing Change Data Capture Data from Another Azure SQL Database

 

SQL: Understanding Change Data Capture for Azure SQL Database – Part 1 – Why?

I often need to capture the changes from one database into another. The most common reason is that I'm wanting to bring changes from a transactional system across into a data warehouse that's part of a BI setup.

So which technology is best to use for this?

That's not a trivial question to answer but here are some thoughts:

Replication?

Unfortunately, this one's not available for Azure SQL DB as yet. Azure SQL DB can be a subscriber in Transactional Replication. We often use it this way. If we have an on-premises SQL Server, one of our favourite ways to get data into the cloud is by using Transactional Replication. (If you need to get your head around Replication with SQL Server, just head to our course here).

There are many advantages to replication, including the lack of impact on the source system, however Azure SQL DB can't currently be a publisher, so it doesn't help here.

And other forms of replication aren't really useful here, or an available option. So if the source DB is an Azure SQL DB, we need to find something else.

Azure SQL Data Sync

Azure SQL Data Sync is an odd technology. It basically grew out of Merge Replication based ideas. It's not built on Merge Replication, but it's very similar in concept. It was in a preview state so long, and the team had so long since stopped posting information about it, that most of us never thought it would ever reach GA.

You create a setup similar to this:

The sync metadata lives in a DB in Azure, and a copy of the DB that you want to sync is created as an Azure SQL DB. The Azure Data Sync engine then synchronizes the data between the HUB and the other DBs. If any of the DBs are on-premises, then an on-premises agent does the work.

Azure Data Sync (like Merge Replication) is trigger-based. Triggers are used to capture the changes ready for synchronization.

I wasn't a fan of Merge, and I can't say I'm a great fan of Azure SQL Data Sync. While it's conceptually simple, you would not want to use it for anything except very low volume applications.

Change Tracking

Change Tracking is another technology that's come directly from SQL Server land. When it's enabled, a set of change tracking tables are created. As data is changed in the tables of interest, changes are recorded in the change tracking tables.

One positive aspect of Change Tracking is that it isn't based on triggers and it outperforms trigger-based solutions. There are two downsides:

  • The changes are written synchronously, and in the context of the transaction that writes the change to the tracked table. This can impact the performance of the changes to the tracked table i.e. usually two writes are happening for each one that would have happened.
  • You don't get to see all the changes, and not in the order that they happened. Change Tracking lets you know which rows have changed, based upon the table's primary key. You can also ask to have a summary of which columns were changed). This can be a challenge for dealing with referential integrity, and other issues.

Queues (and Service Broker)

Another interesting option is to write to a queue. With an on-premises SQL Server, we can use Service Broker. If you haven't seen Service Broker, it's a transacted queue that lives inside the database. (To learn about this, look here).

With SQL CLR code or with External Activation for Service Broker, we could write to other types of queue like RabbitMQ.

At the current time, Azure SQL Database doesn't currently support writing to external queues. However, I do expect to see this change, as so many people have voted to have this capability added.

Change Data Capture

Change Data Capture (CDC) is another technology direct from SQL Server land. CDC is based on reading changes from a database's transaction log.

When you use it with SQL Server, it shares the same transaction log reader that Transactional Replication (TR) does. If you enable either CDC or TR, a log reader is started. If you have both enabled, they use a single log reader.

A key upside of using a log reader is that it doesn't slow down the initial updates to the target table. The changes are read asynchronously, separately.

Until recently, though, you could not use CDC with Azure SQL Database. The log reader agent ran from within SQL Server Agent, and with Azure SQL Database, you didn't have a SQL Server Agent.

The product team have recently done the work to make CDC work with Azure SQL Database.  It is an interesting option for extracting changes from a database, so this is the first blog post in a series of posts about using CDC with Azure SQL Database. Links to other posts will be added here as they are available:

  1. Why use Change Data Capture for Azure SQL Database?
  2. How Change Data Capture works in Azure SQL Database
  3. Enabling and using Change Data Capture in Azure SQL Database
  4. Change Data Capture and Azure SQL Database Service Level Objectives
  5. Accessing Change Data Capture Data from Another Azure SQL Database

 

 

Cosmos Down Under podcast 6 with guest Vincent-Philippe Lauzon is now published!

I published a new Cosmos Down Under podcast today.

In the podcast, I've discussed the new integration between Azure Cosmos DB and Azure Data Explorer with the program manager for the feature: Vincent-Philippe Lauzon.

It's a great new capability and worth half an hour of your time to find out about it.

I hope you enjoy it. You'll find it, along with all Cosmos Down Under podcasts at: https://podcast.cosmosdownunder.com

SDU Podcast #83 with guest Davide Mauri released

I recorded the first of a series of podcasts for the SQL Server 2022 timeframe recently. Once again, it was great to chat with another old friend Davide Mauri. Davide is a Principal Product Manager for Azure SQL Database.

Note that the topic I chose for today was the new  sp_invoke_external_rest_endpoint system stored procedure, and it's currently only available for Azure SQL Database but I expect it will get ported to other flavours of SQL Server over time. Davide has been deeply involved with the development of this procedure.

Developers have wanted a better way to interact with external services from within Azure SQL Database and this procedure really helps with that. I feel it lets the database now integrate much more closely into the Azure ecosystem.

You'll find the podcast with Davide (and all our other podcasts) here: https://podcast.sqldownunder.com

Azure SQL Database now has an improved STRING_SPLIT !

I get pretty excited about new T-SQL enhancements. Back in 2016, I was so pleased to see Microsoft finally add a string split option to T-SQL, but my enthusiasm was limited when I saw how it was implemented. Now that's mostly fixed !

While it's possible to build functions that did string splitting just like I wanted, the problem is that no matter how you implemented them, they were really just too slow. And most importantly, much slower than a built-in function would be.

The Issues

SQL Server 2016 allowed us to write code like this:

That code would return this output:

STRING_SPLIT is a table-valued function. It took two parameters:

  • The string to be split (Unicode or not)
  • The separator

It returned a set of rows with a single column called value, and it was fast.

So what were the problems?

  • The separator could only be a single character
  • Often you need to trim the values being returned. (That can obviously be done outside this function using TRIM but it'd be nice if there was an option to do it in the function).
  • But the biggest issue of all was that because a table-valued function doesn't have an inbuilt order for returning rows, you didn't know which token was which, that had been extracted from the string.

Now a TVF returns a table, and tables don't have built-in orders either. Nor do views. But what was needed was another column that told you which rows was which, and you could then use that for ordering when needed.

Azure SQL Database

STRING_SPLIT in Azure SQL Database now has an optional parameter to enable an ordinal (i.e. a position).

That extra parameter is a bit or an int that says if an ordinal value should be added to the output.

So this code:

Now returns:

I love working with Azure SQL Database for many reasons, but one of those is that you get new language extensions before other parts of SQL Server. I've been using this feature since the day that I first heard it existed, and it's working great. Now the updated documentation is online as well.

It also means that I can obtain any specific token that I want. For example, get me the third token in the string:

And that returns:

This is a much-needed enhancement.

ADF: Use MSIs not SQL Logins whenever possible

Azure Data Factory (ADF) is great for moving data around. It often needs to connect to SQL databases. If you're creating linked services for this, please try to avoid using SQL logins. Don't use usernames and passwords for authentication if you can avoid them.
 
Azure SQL Database lets you can create a user from a data factory's managed service identity (MSI). MSIs are a special type of Service Principal (SP). They provide an identity where you don't need to manage the ID or any sort of password.  Azure does that for you.
 
Each ADF exposes an Azure-based identity on its property page. Here is the main menu for my BeanPerfection data factory:
Data factory properties page
You might have guessed you should go to the Managed identities menu option, but that is for User Managed identities. I'll write more about them another day. The identity we're interested in is a System Managed Identity.
On this properties page, it's actually called the Managed Identity Application ID. Like all other types of Service Principal, Managed Service Identities have an ApplicationID.

Creating a user in Azure SQL Database

I create a user in an Azure SQL Database with the FROM EXTERNAL PROVIDER option. This says that Azure SQL isn't performing the authentication.

The user name is the full name of the ADF. In Azure SQL Database, we can just use the name of the ADF instead of the identity. It will do that for us.

I always make this data factory name lower-case. Once the user is created, I add it to whatever role in the database makes sense. In my example above, I used the role name datafactory_access but there's nothing special about that name.

You need to decide which role to add it to based upon the tasks that the data factory needs to perform. While it's temptingly easy to just add it to db_owner, try to resist that. If in doubt, create a role that's used for data factory access and grant it permissions as they are needed.

Must execute using an AAD-Based Connection

If you just try to execute the statements above, you might find that you get an error message saying:

Msg 33159, Level 16, State 1, Line 8
Principal 'beanperfectiondatafactory' could not be created. Only connections established with Active Directory accounts can create other Active Directory users.

You cannot create any type of Azure AD-based user in an Azure SQL Database, if your connection was authenticated as a SQL login. You must use a connection that was itself made using Azure-AD authentication.

I find that the easiest way to do that, is to make sure I have an Azure Activity Directory Admin assigned for my Azure SQL Server, and then just execute the code right in the Azure Portal. I use the Query Editor tab in the main menu for the Azure SQL Database and connect as that administrator.

 

SQL: Fix: The parameters supplied for the procedure "sp_set_firewall_rule" are not valid.

I often use the procedure sp_set_firewall_rule to set firewall rules for Azure SQL Server. (There's a similar call to set the firewall for databases). The other day though, I got an error that had me puzzled:

I also tried it with named parameters and got the same error.

When I looked at my previous scripts, I realised that I had used a Unicode string for the first parameter previously.

Solution

I changed ‘TestRule’ to N’TestRule’ and it worked fine.

I’ve not seen a procedure before that wants a Unicode string that won’t accept an ASCII string. For example, this works just fine:

Apparently strict data type checking has always been a feature of extended stored procedures and this one checks specifically for a Unicode string. I really don't use extended stored procedures much anyway.

What threw me as well is that I couldn't find it in the list of system stored procedures. It's not there because it's actually an extended stored procedure. These used to mostly have xp prefixes, not sp prefixes, and are a good example of why I don't love using prefixes like that.

I really wish though, that this procedure had a better error message. While the current one is strictly correct, it is not actually all that helpful.

 

SQL: Password complexity rules for Azure SQL

Azure SQL (both Azure SQL Database and Azure SQL Managed Instance) both have different password complexity rules to SQL Server. I was reading an email discussion list and a poster asked where he could find the list of password complexity rules for Azure SQL. I said I'd never seen a list.

Well it turns out that there is a list, but not where you might have thought to look. They're spelled out in this article:

Identify the right Azure SQL Database SKU for your on-premises database (Data Migration Assistant) – SQL Server | Microsoft Docs

To avoid you reading the whole article, at the time of writing, these were the rules for Azure SQL Database:

  • Your password must be at least 8 characters in length and no more than 128 characters in length.
  • Your password must contain characters from three of the following categories – English uppercase letters, English lowercase letters, numbers (0-9), and non-alphanumeric characters (!, $, #, %, etc.).
  • Your password cannot contain all or part of the login name. (Part of a login name is defined as three or more consecutive alphanumeric characters.)

Slightly more confusing is that the article says it's talking about the server admin password, but it also appears that this same list of restrictions applies to all logon/user passwords as well.

Azure SQL Managed Instance is exactly the same except it requires the password to be at least 16 characters in length.

I hope that helps someone, including myself when I next go looking for this.

 

SQL: UNMASK is now granular in Azure SQL

Dynamic Data Masking was added to SQL Server back in 2016. I've almost never used it since. Why? There were two reasons:

#1: The first reason was that it didn't work in conjunction with other security features. In particular, you couldn't use dynamic data masking in conjunction with Always Encrypted. From the start, I've been telling the product group that this is a misstep. There's very little data that I'd want to hide from my own staff, that I'm happy still being visible to a hosting provider.

This is still an issue today. Some users might just want to use Dynamic Data Masking though, without Always Encrypted, so there might be a use case there, particularly with on-premises systems.

#2: (And this was the big one) UNMASK wasn't granular enough. There was a single UNMASK permission for the whole database. That means it worked nothing like other permissions in the database. We can usually set permissions on schemas (my favourite), tables, and columns. And I didn't see UNMASK as any different. Again, a number of us made that very clear to the product team.

However, this has now changed !

I wanted to call attention to a recent blog post that just mentioned in passing that UNMASK was now granular. In fact, you can apply it like before, to the whole DB, but you also apply it at the schema, table, and/or column levels.

Well at least you can in Azure SQL Database (and Synapse Analytics). One of the things I love about working with Azure SQL Database is being at the front of the queue when T-SQL and DB enhancements are made.

This is great news. Thank you to the product team !

Azure: Changing Azure SQL Database Service Objective from T-SQL Commands

When I'm building Azure Data Factory pipelines, I often want to rescale, i.e. change the Service Level Objective (SLO) of Azure SQL Databases before and after processing. For example: I might have a database that sits at S0 or S1 all day long when not being processed (to allows for the odd adhoc query), but I want it at S6 to do overnight ingestion of data and loading of analytic data models. Then I want it to go back to what it was before.

This can make a huge difference to the cost of these databases, when they are used intensely but intermittently.

While it's possible to do this from PowerShell, etc. I've found that the easiest (by far) is to just execute the required commands from T-SQL.

Stored Procedure Approach?

What I really wanted to do, was to put a stored procedure in each database, that can be used to scale that same database. I could have then included that procedure in our SDU Tools.

But you can't do that. While the scaling would work, the stored procedure would always return an error because it would be terminated when the rescaling occurs.

The way to avoid being terminated, is to query the master database instead. The annoying part there, is that with Azure SQL Database, you aren't allowed to put any user objects in the master database. So while it would be a perfect location to hold a stored procedure for changing the SLO of a database.

Just a SQL Batch

In the end, the easiest is to just send the required SQL batch to the master database. In ADF, I can do that with a single Lookup activity. (That's the way you just execute a SQL statement).

It's annoying that the Lookup activity only supports parameters when you're calling a stored procedure. It doesn't support parameters for SQL query batches. So I have to just put the values right into the code.

Regardless, if you want to do this, below is an example of T-SQL code that will do that. You'll need to execute it against the master database, and as a server administrator.

(Note: code might wrap)