SQL: Why use Change Data Capture for Azure SQL Database?

I often need to capture the changes from one database into another. The most common reason is that I'm wanting to bring changes from a transactional system across into a data warehouse that's part of a BI setup.

So which technology is best to use for this?

That's not a trivial question to answer but here are some thoughts:

Replication?

Unfortunately, this one's not available for Azure SQL DB as yet. Azure SQL DB can be a subscriber in Transactional Replication. We often use it this way. If we have an on-premises SQL Server, one of our favourite ways to get data into the cloud is by using Transactional Replication. (If you need to get your head around Replication with SQL Server, just head to our course here).

There are many advantages to replication, including the lack of impact on the source system, however Azure SQL DB can't currently be a publisher, so it doesn't help here.

And other forms of replication aren't really useful here, or an available option. So if the source DB is an Azure SQL DB, we need to find something else.

Azure SQL Data Sync

Azure SQL Data Sync is an odd technology. It basically grew out of Merge Replication based ideas. It's not built on Merge Replication, but it's very similar in concept. It was in a preview state so long, and the team had so long since stopped posting information about it, that most of us never thought it would ever reach GA.

You create a setup similar to this:

The sync metadata lives in a DB in Azure, and a copy of the DB that you want to sync is created as an Azure SQL DB. The Azure Data Sync engine then synchronizes the data between the HUB and the other DBs. If any of the DBs are on-premises, then an on-premises agent does the work.

Azure Data Sync (like Merge Replication) is trigger-based. Triggers are used to capture the changes ready for synchronization.

I wasn't a fan or Merge, and I can't say I'm a great fan of Azure SQL Data Sync. While it's conceptually simple, you would not want to use it for anything except very low volume applications.

Change Tracking

Change Tracking is another technology that's come directly from SQL Server land. When it's enabled, a set of change tracking tables are created. As data is changed in the tables of interest, changes are recorded in the change tracking tables.

One positive aspect of Change Tracking is that it isn't based on triggers and it outperforms trigger-based solutions. There are two downsides:

  • The changes are written synchronously, and in the context of the transaction that writes the change to the tracked table. This can impact the performance of the changes to the tracked table i.e. usually two writes are happening for each one that would have happened.
  • You don't get to see all the changes, and not in the order that they happened. Change Tracking lets you know which rows have changed, based upon the table's primary key. You can also ask to have a summary of which columns were changed). This can be a challenge for dealing with referential integrity, and other issues.

Queues (and Service Broker)

Another interesting option is to write to a queue. With an on-premises SQL Server, we can use Service Broker. If you haven't seen Service Broker, it's a transacted queue that lives inside the database. (To learn about this, look here).

With SQL CLR code or with External Activation for Service Broker, we could write to other types of queue like RabbitMQ.

At the current time, Azure SQL Database doesn't currently support writing to external queues. However, I do expect to see this change, as so many people have voted to have this capability added.

Change Data Capture

Change Data Capture (CDC) is another technology direct from SQL Server land. CDC is based on reading changes from a database's transaction log.

When you use it with SQL Server, it shares the same transaction log reader that Transactional Replication (TR) does. If you enable either CDC or TR, a log reader is started. If you have both enabled, they use a single log reader.

A key upside of using a log reader is that it doesn't slow down the initial updates to the target table. The changes are read asynchronously, separately.

Until recently, though, you could not use CDC with Azure SQL Database. The log reader agent ran from within SQL Server Agent, and with Azure SQL Database, you didn't have a SQL Server Agent.

The product team have recently done the work to make CDC work with Azure SQL Database.  It is an interesting option for extracting changes from a database, so this is the first blog post in a series of posts about using CDC with Azure SQL Database. Links to other posts will be added here as they are available:

  1. Why use Change Data Capture for Azure SQL Database?
  2. How Change Data Capture works in Azure SQL Database
  3. Enabling and using Change Data Capture in Azure SQL Database
  4. Change Data Capture and Azure SQL Database Service Level Objectives
  5. Accessing Change Data Capture Data from Another Azure SQL Database

 

 

Book: Implementing Power BI in the Enterprise

It's been a while coming, but my latest book is now out. Implementing Power BI in the Enterprise is now available in both paperback and eBook. The eBook versions are available in all Amazon stores, and also through most book distributors through Ingram Spark distribution.

I've had a few people ask about DRM-free ePub and PDF versions. While the Kindle version on Amazon is their normal DRM setup, you can purchase the DRM free version directly from us here:

https://sqldownunder.thrivecart.com/implementing-power-bi-ent-ebook/

It contains both the ePub and PDF versions.

Book Details

Power BI is an amazing tool. It's so easy to get started with and to develop a proof of concept. Enterprises want more than that. They need to create analytics using professional techniques.

There are many ways that you can do this but in this book, I've described how I implement these projects.  And it's gone well for many years over many projects.

If you want a book on building better visualizations in Power BI, this is not the book for you.

Instead, this book will teach you about architecture, identity and security, building a supporting data warehouse, using DevOps and project management tools, learning to use Azure Data Factory and source control with your projects.

It also describes how I implements projects for clients with differing levels of cloud tolerance, from the cloud natives, to cloud friendlies, to cloud conservatives, and to those clients who are not cloud friendly at all.

I also had a few people ask about the table of contents. The chapters are here:

  • Power BI Cloud Implementation Models
  • Other Tools That I Often Use
  • Working with Identity
  • Do you need a Data Warehouse?
  • Implementing the Data Model Schema
  • Implementing the Analytics Schema
  • Using DevOps for Project Management and Deployment
  • Staging, Loading and Transforming Data
  • Implementing ELT and Processing
  • Implementing the Tabular Model
  • Using Advanced Tabular Model Techniques
  • Connecting Power BI and Creating Reports

I hope you enjoy it.

SQL Day 2021 is on, and I'd love to see you in my Power BI pre-con

One of my favourite conferences each year is SQL Day. It's run by an enthusiastic group from Poland, and when I've attended in person, I loved it. This year it's virtual, and the upside of that, is you can attend from anywhere.

As part of the conference, I'm running a pre-con workshop. It's a low cost one day course on How I Implement Power BI in Enterprises. You'll find info on it here. The course is running on Poland time, but it looks to me like the times will suit a pretty wide variety of people, including from here in Australia.

More info here:

I'd love to see you there.

 

ADF: Where did "discard all changes" go in Azure Data Factory?

I'm a big fan of Azure Data Factory (ADF), but one of the things you need to get used to with tools like this, is that the UI keeps changing over time. That makes it hard for several reasons:

  • It's hard to train people. Any recorded demo you have will show them things that no longer exist, within a fairly short period of time.
  • Every time a cosmetic change occurs, it immediately devalues blog posts, tutorials, etc. that are out on the Internet.

I think Microsoft don't quite get how much blog posts, etc. supplement their own documentation.

It's the same with Power BI. I used to often teach a Power BI class on a Tuesday, as the second day of our week-long BI Core Skills class. And Tuesday was the day that Power BI changes would appear. So I'd be showing a class how to do something, and suddenly I'm trying to find where a tool I use regularly went.

So even people who work with these tools all the time, keep having odd moments where they go to click something they've used for ages, and it's just not there any more.

I had one of these moments the other day in ADF. I went to click on the menu item for "Discard all changes" and it had vanished. My mistake is that I kept looking through the menus and wondering where it went. I didn't notice that it had become a trash-can icon in the top-right of the screen.

So this is just a post for anyone else who's wondering where it went. (Or for someone following a tutorial or blog post and can't find the menu item).

BI: Can you explain where your analytic data came from?

I've seen many challenges with analytics over the years. One that's painful is an inability to explain where analytic data came from. Someone looks at a report, sees a value, and says I don't believe that number. Don't put yourself in that position !
Lineage
 
I load analytics from data warehouses. Most of my data warehouses are SQL Server databases of some type. Currently, they're almost always Azure SQL Databases. I like to include information in the database, about how the data got there i.e. the lineage of the data.
 

How can I record the lineage?

Most analytic data that I work with gets loaded/processed in batches. Sometimes it's overnight. Other times it's every few minutes. But either way, there's a process that's run, and it puts the data in place.
 
Each time the process runs, I put details in a table about the process run. And I put the key for that table into every row updated in that run.
 

What goes in the lineage table?

 
You might choose different values, but at a bare minimum I'd suggest:
When the process ran – I need to know when the data came in.
 
Which process – I need to know which process loaded and updated the data.
Source system – Which system/database/file/other source did the data come from?
 
Process identity – I need to know the identity that the process used when querying the source data. (Different identities might return different data from the source)
At least when someone asks about a row of data, you could at least say that it came from the XYZ source system at 12:24PM on May 12th 2021, and it was loaded and transformed by the PPPP package running as UUUU. You have a chance of establishing the validity of the data.

Opinion: Reports aren't tables

I love the way that tools like Power BI have made creating analytics approachable for a much wider range of users. The skills acquired when doing this are a great start for using analytics in enterprises, but an area that I still see missing is data modelling.

What I see users doing is this:

  • Decide what the output reports, etc. should look like
  • Design tables that match the reports.

And this is where things go wrong.

Reports aren't tables

Take an example where I have customers who have accounts. A report on customers seems to match pretty well with a table i.e. it has some sort of identifier for the customer, their name, their address, etc.

But then the user decides that they need an accounts table. They look at an accounts report, and see that it has things in it like the name of the customer, or their address. And so they model an accounts table with attributes of the customer, as well as attributes of the account.

The same thing happens with the customer. The customers report might have had some items that are account-related. So those attributes get added to the customers table as well.

You can imagine how this evolves. It's well-intentioned; it lets them get data quickly; but for an enterprise? It's the start of an inglorious mess.

If you're starting to create tables for enterprise analytics, you really need to learn about data modelling. Don't assume that tables are just copies of what's needed in a report.

BI: DataWeek starting soon – don't miss it

I was really excited today to see details of the upcoming #dataweek. Anything with a name like that already has my interest.

In this case though, there's a #CreateData event that's happening as part of #dataweek. And the focus of that looks to be on Azure Synapse Analytics.

Azure Synapse Analytics

I know that many of my readers won't have spent much time with Azure Synapse Analytics. The product has grown out of Azure SQL Data Warehouse (note not Azure SQL Database) but it's now much more than just an upgraded Azure SQL DW.

It's a tool that combines data warehousing with Big Data analytics. So many of the early Big Data offerings struck me as just Big Hype. I wrote about that back in 2013. And early last year, I wrote about my concern that about how "modern" seems to be treated as a synonym for "better", but it certainly isn't. And Big Data was an example for me, in particular Hadoop.

My real concern was that most of these tools were only addressing a small part of what was needed.

Why Azure Synapse Analytics ?

One key advantage of Azure Synapse Analytics is that it tackles both the data warehousing and Big Data needs, in a single product. That's far more useful. And it's trying to do it without a lot of moving data around, as that introduces latency.

I spent the last two weeks on a DevOps project focussing on building and deploying Azure Synapse Analytics projects. I was impressed by how mature the tooling now was, compared to when I'd last looked at it.

If you're not across where this product is at, I'd encourage you to consider registering for #dataweek.

ADF: Time zone support in Data Factory – a Small Change but so Important

I work with a lot of technology on a daily basis, much of it amazing. But I still get excited when relatively small enhancements are made, and they make my life or development much better.

Timezone support in schedule triggers in Azure Data Factory is one of those changes.

Schedule Triggers

In Data Factory, the most common type of trigger that we use is a schedule trigger. It's used to run a pipeline at a specific time or series of times. But one of the most painful aspects of these triggers is that they didn't have time zone support.

That meant that for a town like Melbourne or Sydney where daylight savings time applied, I couldn't set a pipeline to just run at say 2AM every day, Melbourne time. Every six months, I'd have to go in and change the trigger times to achieve that.

I'd also have to endlessly convert between our local times and UTC as that's the time that would be entered.

If you only have a few triggers, that might be OK, but when you have a lot of them, well that's painful.

Design was Ready

When I looked at the JSON representation of the trigger, there was a time zone in there:

So I knew that whoever designed it meant for it to work. But ADF wouldn't support any other value than UTC.

Great News !

In the last few days, we got the great news that time zones are now supported in these triggers.

You can see the option now in the UI:

I can't tell you how pleased I am to see that there.

Thank you ADF team !

 

Power BI: 5 Minutes to "Wow" and for enterprises, what's next?

Power BI is amazing. And it's starting to appear all over the place. Many enterprises don't know what to make of it though. Some are scared that it'll be the "next Access" where stores of uncontrolled data end up all over the organization. Power BI's mantra of "5 minutes to Wow" is spot on. It's easy to be impressed. But enterprises are often struggling with "what comes next after that 5 minutes?"

We've been implementing many enterprise-level projects that use Power BI, and I'd love to spend time showing you what we do.

Amit Bansal and the DPS team in India are hosting a virtual summit later this year. There are three days of data-related conference presentations delivered 24 x 7. If you can't find something of real interest to you, and in your time zone, you aren't looking. And the price? At present it's $89 USD. Yes you read that right. And it includes a year of access to the recordings of all the normal conference sessions.

As part of their Data Platform Virtual Summit  event, there are also pre-cons and post-cons. Again they are amazing value.

I'm presenting a pre-con over two half days. At the current discount, it's $129 USD and if you use the discount code GREG@DPS, it's even cheaper. I'd love to see many of you attend. Let's talk about Power BI in the enterprise.

Now, this isn't a session about how to make great visuals, etc. I'll leave that to someone else. But if you want to really get into what makes a good underlying data model, how to integrate properly with security, how to structure projects, how we use Azure Analysis Services and/or Power BI Premium and more, this should be the right session for you.

You can book here.

Fix: Unexpected error in Analysis Services Power Query designer in Visual Studio 2019

I was editing using the Power Query editor in an Analysis Services project, hosted in Visual Studio 2019. When I tried to use "Add column by example", I received the error shown above:

Unexpected Error

Could not load file or assembly 'Microsoft.DataIntegration.TransformDataByExample, Version=1.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad34e35' or one of its dependencies. The system cannot find the file specified.

The problem is that the assembly had not been deployed with the Analysis Services Designer.

Fortunately, the same assembly is used by Power BI Desktop. Because I had that installed on the same machine, I was able to copy the assembly:

Microsoft.DataIntegration.TransformDataByExample.dll

from the folder:

C:\Program Files\Microsoft Power BI Desktop\bin

to the same folder as the devenv.exe program from VS 2019:

C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\Common7\IDE\

If that path doesn't exist on your machine, it might be here instead if you're using the Community edition:

C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\Common7\IDE\

After that, it was back working again. I hope that helps someone else.