SQL: Understanding Change Data Capture for Azure SQL Database - Part 1 - Why?

I often need to capture the changes from one database into another. The most common reason is that I’m wanting to bring changes from a transactional system across into a data warehouse that’s part of a BI setup.

So which technology is best to use for this?

That’s not a trivial question to answer but here are some thoughts:

Replication?

Unfortunately, this one’s not available for Azure SQL DB as yet. Azure SQL DB can be a subscriber in Transactional Replication. We often use it this way. If we have an on-premises SQL Server, one of our favourite ways to get data into the cloud is by using Transactional Replication. (If you need to get your head around Replication with SQL Server, just head to our course here).

There are many advantages to replication, including the lack of impact on the source system, however Azure SQL DB can’t currently be a publisher, so it doesn’t help here.

And other forms of replication aren’t really useful here, or an available option. So if the source DB is an Azure SQL DB, we need to find something else.

Azure SQL Data Sync

Azure SQL Data Sync is an odd technology. It basically grew out of Merge Replication based ideas. It’s not built on Merge Replication, but it’s very similar in concept. It was in a preview state so long, and the team had so long since stopped posting information about it, that most of us never thought it would ever reach GA.

You create a setup similar to this:

The sync metadata lives in a DB in Azure, and a copy of the DB that you want to sync is created as an Azure SQL DB. The Azure Data Sync engine then synchronizes the data between the HUB and the other DBs. If any of the DBs are on-premises, then an on-premises agent does the work.

Azure Data Sync (like Merge Replication) is trigger-based. Triggers are used to capture the changes ready for synchronization.

I wasn’t a fan of Merge, and I can’t say I’m a great fan of Azure SQL Data Sync. While it’s conceptually simple, you would not want to use it for anything except very low volume applications.

Change Tracking

Change Tracking is another technology that’s come directly from SQL Server land. When it’s enabled, a set of change tracking tables are created. As data is changed in the tables of interest, changes are recorded in the change tracking tables.

One positive aspect of Change Tracking is that it isn’t based on triggers and it outperforms trigger-based solutions. There are two downsides:

The changes are written synchronously, and in the context of the transaction that writes the change to the tracked table. This can impact the performance of the changes to the tracked table i.e. usually two writes are happening for each one that would have happened.
You don’t get to see all the changes, and not in the order that they happened. Change Tracking lets you know which rows have changed, based upon the table’s primary key. You can also ask to have a summary of which columns were changed). This can be a challenge for dealing with referential integrity, and other issues.

Queues (and Service Broker)

Another interesting option is to write to a queue. With an on-premises SQL Server, we can use Service Broker. If you haven’t seen Service Broker, it’s a transacted queue that lives inside the database. (To learn about this, look here).

With SQL CLR code or with External Activation for Service Broker, we could write to other types of queue like RabbitMQ.

At the current time, Azure SQL Database doesn’t currently support writing to external queues. However, I do expect to see this change, as so many people have voted to have this capability added.

Change Data Capture

Change Data Capture (CDC) is another technology direct from SQL Server land. CDC is based on reading changes from a database’s transaction log.

When you use it with SQL Server, it shares the same transaction log reader that Transactional Replication (TR) does. If you enable either CDC or TR, a log reader is started. If you have both enabled, they use a single log reader.

A key upside of using a log reader is that it doesn’t slow down the initial updates to the target table. The changes are read asynchronously, separately.

Until recently, though, you could not use CDC with Azure SQL Database. The log reader agent ran from within SQL Server Agent, and with Azure SQL Database, you didn’t have a SQL Server Agent.

The product team have recently done the work to make CDC work with Azure SQL Database. It is an interesting option for extracting changes from a database, so this is the first blog post in a series of posts about using CDC with Azure SQL Database. Links to other posts will be added here as they are available: