In the part 1 of this series, I discussed the positioning of Change Data Capture. In part 2, I want to cover how it works.
There are many ways that you can output details of changes that occur in data within SQL Server. Many of those methods require actions to occur at the time the data change is made. This can be problematic.
The first problem with this, is the performance impact on the application that's making the change. If I update a row in a table and there is part of the process that writes details of that change to some type of audit or tracking log, I've now increased the work that needs to happen in the context of the application that's making the change. Generally what this means, is that I've slowed the application down by at least doubling the work that needs to be performed. That might not be well-received.
The second potential problem is even nastier. What if the change tracking part of the work fails even though the original update works? If I've done the change tracking work in the context of the original update (particularly if it's done as part of an atomic process), by adding tracking I might have broken the original application. That certainly wouldn't be well-received.
So what to do?
The best answer seems to be to work with the transaction log that's already built into SQL Server. By default, it does have details of the changes that have been occurring to the data. It can be read asynchronously so delays in reading it mostly won't affect the original data changes at all (there are only rare exceptions to this). If the reading of the logs failed, the problem can be corrected and the reading can be restarted, all again without affecting the main updates that are occurring.
And that's what Change Data Capture does. It uses the same log reader agent that has been part of SQL Server for a very long time. Previously though, it was used for Transactional Replication. In fact if you use both Transactional Replication and Change Data Capture on the same SQL Server system, they share the same instance of the log reader. The SQL Server Agent is used to make them both work.
SQL Server Agent – Isn't that missing?
When we're working with Azure SQL Database, things are a bit different. Currently, we don't have any concept of Transactional Replication. That could change but right now, it's not there. So sharing the log reader isn't an issue.
But I also mentioned that with SQL Server, it was the SQL Server Agent that kicks off the log reading agent. And with Azure SQL Database, we don't have SQL Server Agent either !
The Azure SQL Database team have instead provided a scheduler that runs the log reader (called the capture), and also runs the required clean-up tasks. SQL Server had another agent to perform clean-up. This is all automated and requires no maintenance from the user.
Change Data Capture (CDC) Data Flow
The data flow with CDC is basically like the following:
The original app sends a change (insert, update, delete) to a table in the Azure SQL Database.
The change is recorded in the transaction log.
Some time later (usually not long though), the change is read by the capture process and stored in a change table.
The Data Warehouse (DW) or target system or ETL system makes a call to a set of CDC functions to retrieve the changes.
Everything in the dotted box above is part of, and contained within, the Azure SQL Database.
In the next section, I'll show you the code that's required, and show you the changes that occur to the Azure SQL Database when you enable CDC.
I often need to capture the changes from one database into another. The most common reason is that I'm wanting to bring changes from a transactional system across into a data warehouse that's part of a BI setup.
So which technology is best to use for this?
That's not a trivial question to answer but here are some thoughts:
Unfortunately, this one's not available for Azure SQL DB as yet. Azure SQL DB can be a subscriber in Transactional Replication. We often use it this way. If we have an on-premises SQL Server, one of our favourite ways to get data into the cloud is by using Transactional Replication. (If you need to get your head around Replication with SQL Server, just head to our course here).
There are many advantages to replication, including the lack of impact on the source system, however Azure SQL DB can't currently be a publisher, so it doesn't help here.
And other forms of replication aren't really useful here, or an available option. So if the source DB is an Azure SQL DB, we need to find something else.
Azure SQL Data Sync
Azure SQL Data Sync is an odd technology. It basically grew out of Merge Replication based ideas. It's not built on Merge Replication, but it's very similar in concept. It was in a preview state so long, and the team had so long since stopped posting information about it, that most of us never thought it would ever reach GA.
You create a setup similar to this:
The sync metadata lives in a DB in Azure, and a copy of the DB that you want to sync is created as an Azure SQL DB. The Azure Data Sync engine then synchronizes the data between the HUB and the other DBs. If any of the DBs are on-premises, then an on-premises agent does the work.
Azure Data Sync (like Merge Replication) is trigger-based. Triggers are used to capture the changes ready for synchronization.
I wasn't a fan of Merge, and I can't say I'm a great fan of Azure SQL Data Sync. While it's conceptually simple, you would not want to use it for anything except very low volume applications.
Change Tracking is another technology that's come directly from SQL Server land. When it's enabled, a set of change tracking tables are created. As data is changed in the tables of interest, changes are recorded in the change tracking tables.
One positive aspect of Change Tracking is that it isn't based on triggers and it outperforms trigger-based solutions. There are two downsides:
The changes are written synchronously, and in the context of the transaction that writes the change to the tracked table. This can impact the performance of the changes to the tracked table i.e. usually two writes are happening for each one that would have happened.
You don't get to see all the changes, and not in the order that they happened. Change Tracking lets you know which rows have changed, based upon the table's primary key. You can also ask to have a summary of which columns were changed). This can be a challenge for dealing with referential integrity, and other issues.
Queues (and Service Broker)
Another interesting option is to write to a queue. With an on-premises SQL Server, we can use Service Broker. If you haven't seen Service Broker, it's a transacted queue that lives inside the database. (To learn about this, look here).
With SQL CLR code or with External Activation for Service Broker, we could write to other types of queue like RabbitMQ.
At the current time, Azure SQL Database doesn't currently support writing to external queues. However, I do expect to see this change, as so many people have voted to have this capability added.
Change Data Capture
Change Data Capture (CDC) is another technology direct from SQL Server land. CDC is based on reading changes from a database's transaction log.
When you use it with SQL Server, it shares the same transaction log reader that Transactional Replication (TR) does. If you enable either CDC or TR, a log reader is started. If you have both enabled, they use a single log reader.
A key upside of using a log reader is that it doesn't slow down the initial updates to the target table. The changes are read asynchronously, separately.
Until recently, though, you could not use CDC with Azure SQL Database. The log reader agent ran from within SQL Server Agent, and with Azure SQL Database, you didn't have a SQL Server Agent.
The product team have recently done the work to make CDC work with Azure SQL Database. It is an interesting option for extracting changes from a database, so this is the first blog post in a series of posts about using CDC with Azure SQL Database. Links to other posts will be added here as they are available:
ExcelSerialToDateTime – there's so much corporate data stored in Excel, and if you need to import it to SQL Server, you quickly find that dates and times are a challenge. I have a friend who says that the only data types in Excel are numbers and strings, and that everything else is an illusion. I think that's pretty accurate. This function takes one of the funky Excel decimal numbers (called a Serial) that represents a date and time, and converts it to a SQL Server datetime value.
DateTimeToExcelSerial – no prizes for guessing what this one does. It's the opposite of the above.
We've done a lot of updates to internal data within the tools, like all up to date SQL Server build numbers, etc.
DateDimensionColumns now has StartOfWeek and EndOfWeek
DateDimensionPeriodColumns now has IsSameWeek, DayOffset, MonthOffset, and WeekOffset
I recorded the first of a series of podcasts for the SQL Server 2022 timeframe recently. Once again, it was great to chat with another old friend Davide Mauri. Davide is a Principal Product Manager for Azure SQL Database.
Note that the topic I chose for today was the new sp_invoke_external_rest_endpoint system stored procedure, and it's currently only available for Azure SQL Database but I expect it will get ported to other flavours of SQL Server over time. Davide has been deeply involved with the development of this procedure.
Developers have wanted a better way to interact with external services from within Azure SQL Database and this procedure really helps with that. I feel it lets the database now integrate much more closely into the Azure ecosystem.
For this summit, I'll be presenting a quick (around 40 minutes) session highlighting what's changed in T-SQL for SQL Server 2022. I'm always so glad to see T-SQL enhancements in SQL Server and SQL Server 2022 has more than what we've seen in other recent versions. There are a number of very important enhancements that will take a little while to get our heads around, on the best way to use them.
I've also seen the list of people presenting and the range of topics for the conference, and it really looks quite fascinating. There is content in Polish but the majority is in English so it's completely accessible for us English speakers.
I'd really love to see as many of you as possible attending, to support the Polish data community.
Charles (and ex-Melbourne local) is working for Microsoft in the Azure Cosmos DB team. His current focus is the PostgreSQL API that Azure Cosmos DB supports.
In this show, Charles discusses the state of the PostgreSQL market, how it's implemented in the Azure Cosmos DB for PostgreSQL API, how that compares to Microsoft's Flexible Server offering for PostgreSQL, distributed tables, scaling databases, and so much more.
I have a varied client base. Many of my customers are traditional larger financial organizations. But I also work with start-ups and software houses. (Microsoft calls these ISVs – Independent Software Vendors).
Most software houses don't want to sell software any more. Instead, they want to repackage their solutions as services i.e. they want to offer Software as a Service. (SaaS)
Converting an application to a service is not easy. Even Microsoft had false starts with moving their on-premises applications to cloud-based offerings.
Lately, I've spent a lot of time considering Microsoft's Well Architected Framework (WAF). It provides pillars of guidance targeted at architectural excellence. There is a specific article that covers Multitenant SaaS on Azure. Part of that discussion is around the use of Azure SQL Database and Elastic Pools. This is an area that's often missed in these discussions, and it's the database aspects of that topic that I want to drill into further in this post.
Supporting SaaS with Databases
Databases are dear to my heart. It's critical to get the database aspects of an SaaS project right, and do that early on.
On-premises applications are mostly single-tenant. They support a single organization. We do occasionally see multi-tenant databases. They hold the same types of information for many organizations.
But what about SaaS based applications? By default you'll want to store data for many client organizations. Should you create a large single database that holds data for everyone? Should you create a separate database for each client? Or should you create something in-between.
As with most things in computing, there is no one simple answer to this.
Here are the main decision points that I look at:
For me, this is the #1 item. You have to decide how important it is to isolate one client's data from other's. Developers always tell me "the app does that".
In multi-tenant databases, you are always only a single WHERE clause away from showing one client's data, to another client. Writing clean code with great separation is much harder in multi-tenant databases.
You need to start by considering what would happen if this went wrong. For some ISVs, this would be deeply embarrassing but still manageable. For other ISVs, this would be terminal for the organization.
Imagine the discussion between your CEO and a client after you showed their data to someone else. How would that discussion go? Would you still have a job? Would you still have an organization?
If you have even the slightest doubt about this, use a separate database per client. You still have the chance to mess that up, but you are starting in a better place.
This one is pretty easy. Single larger databases still cost less than large numbers of smaller databases. If cost is your main issue, you will lean towards single-database designs.
Note: Azure SQL Database offers elastic database pools. Consider them before making your decision.
A key aspect of "cloudiness" is resource granularity. Having separate databases for each client is much more "cloudy". Multi-database designs have many advantages. Each of your clients might need different levels of performance. Or they might need different features. This is easy in a multi-database design.
You might also be able to add (and on-sell) features to specific clients. For example, one client might want a read-only copy of their data in another location. Other clients don't need that.
Having separate databases lets you decide these things on a client-by-client basis. You can scale the databases up and down to different levels, for each client.
And it can even vary within a single client organization. They might require different environments (i.e.: Production, UAT, Test, etc.). Each environment can have different capabilities and features.
Everyone who's used a single database to support a large number of clients has run into the "noisy neighbor" situation at some time.
One client can't get their work done because of the overhead from another client. They don't realize they are using the same database as someone else.
These problems are far easier to solve with separate databases.
If your clients are all in a single database, moving one client to another database can be hard. You need to plan for that from day #1.
Query Performance and Complexity
With multi-tenant databases, almost every table will contain the tenant ID. This column will be part of almost every join operation.
It's easy to see that these databases perform worse.
The code you need to write is also more complex.
DevOps and Deployment Friendliness
A multi-database model is much more DevOps friendly. Single smaller databases can often be "spun-up" and then destroyed again, much faster than larger databases.
It is much easier to give each developer their own environment with their own databases during development and testing.
A single large database is obviously easier to manage, but you should be looking to automate the deployment and core management of your databases anyway.
Have you considered what happens when an "oops" moment occurs? One client wants to roll back their data to an earlier point in time. This can be very complex to achieve with single-database designs. It's trivial to achieve with multi-database designs.
What if different clients have different data retention policies?
What if clients want to be able to import/export their own data?
You can do all these things with both designs, but they are much simpler with multiple databases.
You need to consider who will own the data that's used by your application. When you use a single-database model with multiple tenants, it's clearly going to be a database that you own and are responsible for.
Is that an appropriate risk for your organization?
One model that I'm seeing more commonplace now is that while the ISV owns and operates the application, the end customers own their own databases (and pay for them). This might simplify your legal situation in regards to data ownership. It might also help if there are data sovereignty issues.
Take legal advice on this.
As I mentioned, this isn't a simple decision. Nowadays for cloud-based PaaS (Platform as a Service) databases to support SaaS applications though, unless there is a compelling reason not to, I'd suggest starting with a separate database for each client every time.
Will is an Engineer for Microsoft working in the Fast Track for Azure team. His focus is on application development and modernization. He is passionate about developers from all walks of life build great things using Azure. Prior to joining Microsoft, he has worked as a software engineer in the financial and agricultural sectors. He was also a Microsoft MVP in the Data Platform category, focusing on Azure Cosmos DB.
In this show, Will discusses how data is stored in Azure Cosmos DB. In particular, we discussed containers and partitioning.
This month, T-SQL Tuesday is hosted by an old friend of mine Kevin Kline. He's an old friend as I've known him a long time, not because of age. (There are lots of miles left in Kevin). The invite asked about a conference or event that changed your life, that created an opportunity, or just changed your life. It could be career related or another area.
I was invited to join the Microsoft Regional Director program back in the early 2000's. The RD program had grown out of the original DevDays events, and the RDs were the people who spoke at those events. By the time I joined, they were commonly used as speakers at events like TechEd around the world.
In fact, unlike the MVP program, the RD program had an agreement. An MVP award is for what you've done in the previous year. There are no expectations of what you'll do during the year you are awarded, apart from basically not being a horrible person. The RD program had an agreement on what you will do in the upcoming period. Speaking at tier-1 events was expected, and you were also expected to be in the top 10% of speaker ratings at those events.
Tough crowd to join!
TechEd USA events
I was used to technical training, and I spoke at many TechEd and other events around the world. However, at one of the first TechEd events in the USA, I remember signing up for the provided speaker training. I can't tell you the effect that had on me.
Microsoft had set aside a full conference room, with lighting, etc. just for the training, and they had an amazing speaker coach (Richard Klees) who worked with you one-on-one. Interestingly, his entire focus was on "big room" speaking, not on presentations in general. Speaking in big rooms is different. I remember him pointing out that the Rolling Stones don't play coffee shops, and if they did, they would act entirely differently.
I spoke to several other RDs who had attended the training with him before, and did not like him. Ironically, I loved working with him. Others said they found his training far too confronting. They didn't seem to understand that he wasn't there to be your friend. He was there to change just enough things to help you make a better impact, without trying to totally change how you speak. He didn't want to turn you into a basket-case a day or so before your session or sessions.
There were so many things I learned in those sessions. A big issue for me is that I was too softly spoken for a big room. By the time he was happy, I felt like I was pretty much shouting, and he still rated the energy level at about 5 out of 10.
By the time it was finished, I was fairly shattered and completely exhausted. Did it improve my sessions? Absolutely no doubt about it. There are so many things he said that are stuck in my mind still.
TechEd Australia Keynote
I loved speaking at TechEd when it ran. I remember presenting in various locations in the USA, Barcelona, Hong Kong, Australia, New Zealand, and more. Working with the developer evangelists from around the world was a highlight. And of course our local ones like Chuck Sterling, Andrew Coates, Dave Glover, etc. were highlights.
The 2004 event in Australia was notable for me, because as well as my sessions, I presented part of the keynote with Eric Rudder, a Microsoft senior vice-president of server and tools. I ended up spending quite a bit of time at various events with Eric during that week.
It was fascinating to see the contrast between the social Eric with user events, and the business Eric in the rehearsals for the keynote, where he was razor-sharp focussed on the delivery were all going to make. People attending those keynotes really had no idea how much work and rehearsing went into delivering them. The section that Terry Clancy and I delivered was pretty smooth. You'd hope so, given the number of times we rehearsed it over many nights. Let's just say that I'm glad I wasn't part of the SharePoint section.
When the event arrived, I remember watching Eric psyching himself up to present, and the level of energy he put into each of us backstage. I was pleased with our section and Eric seemed super happy backstage afterwards.