FIX: Data Factory ODBC linked service fails to Apply and returns Internal Server Error

I was working with a client who has having trouble debugging an ADF pipeline, related to an ODBC linked service not working as expected.

The user had configured the connection string property of an ODBC connection this way:

  •  Set a parameter to the linked service as ServiceDSN
  •  Configured the connection string as @concat('DSN=',linkedService().ServiceDSN)

The Test Connection for that worked fine, but when you click Apply, it fails with an Internal Server Error. Specifically, the error was:

Failed to encrypt linked service credentials on linked self-hosted IR 'XXX-XXXX01' through service bus, reason is: InternalServerError, error message is: Internal Server Error

Absolutely no idea why. Has to be a bug.

Other Symptoms

What it had done though, is leave the linked service connection details in a weird state. Trying to access it via the Test Connection option at the dataset level, showed Data source name too long.

What finally gave me a clue, is that when I looked at the error message in the log on the shared integration runtime, it actually said Data source name too long not found in mapping in connector. Now apart from the lousy English on this one, it's interesting that in the UI, only the first part of the message surfaced. The additional not found part was a major hint. It wasn't finding the entry for the linked service, for the interactive mode used during debugging in the portal.

Solution

Anyway, the solution was to configure the connection string as DSN=@{linkedService().ServiceDSN} instead. That tests just the same, but doesn't throw the internal server error when you try to apply it. And it works as expected.

No idea why the way you construct the connection string matters, as they both return the same string, but it does. Both methods test fine, but one explodes when you try to apply it.

Another Related Error

One other thing I saw periodically during testing was an error that said:

Format of the initialization string does not conform to specification starting at index 0

This error occurs if the connection string happens to just contain the DSN and not the string with the DSN= prefix.

Hope any/all of these help someone else.

Cosmos Down Under podcast 7 with guest Rodrigo Souza is now published!

I was able to record another new Cosmos Down Under podcast today. My guest was Microsoft Senior Program Manager Rodrigo Souza.

In the show, we discussed the Change Data Capture feed for the Analytical store in Azure Cosmos DB. This is a powerful new capability and worth learning about.

I hope you enjoy the show.

https://podcast.cosmosdownunder.com/

 

SQL: Understanding Change Data Capture for Azure SQL Database – Part 2 – How does it work?

In the part 1 of this series, I discussed the positioning of Change Data Capture. In part 2, I want to cover how it works.

Log Reading

There are many ways that you can output details of changes that occur in data within SQL Server. Many of those methods require actions to occur at the time the data change is made. This can be problematic.

The first problem with this, is the performance impact on the application that's making the change. If I update a row in a table and there is part of the process that writes details of that change to some type of audit or tracking log, I've now increased the work that needs to happen in the context of the application that's making the change. Generally what this means, is that I've slowed the application down by at least doubling the work that needs to be performed. That might not be well-received.

The second potential problem is even nastier. What if the change tracking part of the work fails even though the original update works? If I've done the change tracking work in the context of the original update (particularly if it's done as part of an atomic process), by adding tracking I might have broken the original application. That certainly wouldn't be well-received.

So what to do?

The best answer seems to be to work with the transaction log that's already built into SQL Server. By default, it does have details of the changes that have been occurring to the data. It can be read asynchronously so delays in reading it mostly won't affect the original data changes at all (there are only rare exceptions to this). If the reading of the logs failed, the problem can be corrected and the reading can be restarted, all again without affecting the main updates that are occurring.

And that's what Change Data Capture does. It uses the same log reader agent that has been part of SQL Server for a very long time. Previously though, it was used for Transactional Replication. In fact if you use both Transactional Replication and Change Data Capture on the same SQL Server system, they share the same instance of the log reader. The SQL Server Agent is used to make them both work.

SQL Server Agent – Isn't that missing?

When we're working with Azure SQL Database, things are a bit different. Currently, we don't have any concept of Transactional Replication. That could change but right now, it's not there. So sharing the log reader isn't an issue.

But I also mentioned that with SQL Server, it was the SQL Server Agent that kicks off the log reading agent. And with Azure SQL Database, we don't have SQL Server Agent either !

The Azure SQL Database team have instead provided a scheduler that runs the log reader (called the capture), and also runs the required clean-up tasks. SQL Server had another agent to perform clean-up. This is all automated and requires no maintenance from the user.

Change Data Capture (CDC) Data Flow

The data flow with CDC is basically like the following:

CDC Data Flow

  1. The original app sends a change (insert, update, delete) to a table in the Azure SQL Database.
  2. The change is recorded in the transaction log.
  3. Some time later (usually not long though), the change is read by the capture process and stored in a change table.
  4. The Data Warehouse (DW) or target system or ETL system makes a call to a set of CDC functions to retrieve the changes.

Everything in the dotted box above is part of, and contained within, the Azure SQL Database.

Upcoming

In the next section, I'll show you the code that's required, and show you the changes that occur to the Azure SQL Database when you enable CDC.

  1. Why use Change Data Capture for Azure SQL Database?
  2. How Change Data Capture works in Azure SQL Database
  3. Enabling and using Change Data Capture in Azure SQL Database
  4. Change Data Capture and Azure SQL Database Service Level Objectives
  5. Accessing Change Data Capture Data from Another Azure SQL Database

 

SQL: Understanding Change Data Capture for Azure SQL Database – Part 1 – Why?

I often need to capture the changes from one database into another. The most common reason is that I'm wanting to bring changes from a transactional system across into a data warehouse that's part of a BI setup.

So which technology is best to use for this?

That's not a trivial question to answer but here are some thoughts:

Replication?

Unfortunately, this one's not available for Azure SQL DB as yet. Azure SQL DB can be a subscriber in Transactional Replication. We often use it this way. If we have an on-premises SQL Server, one of our favourite ways to get data into the cloud is by using Transactional Replication. (If you need to get your head around Replication with SQL Server, just head to our course here).

There are many advantages to replication, including the lack of impact on the source system, however Azure SQL DB can't currently be a publisher, so it doesn't help here.

And other forms of replication aren't really useful here, or an available option. So if the source DB is an Azure SQL DB, we need to find something else.

Azure SQL Data Sync

Azure SQL Data Sync is an odd technology. It basically grew out of Merge Replication based ideas. It's not built on Merge Replication, but it's very similar in concept. It was in a preview state so long, and the team had so long since stopped posting information about it, that most of us never thought it would ever reach GA.

You create a setup similar to this:

The sync metadata lives in a DB in Azure, and a copy of the DB that you want to sync is created as an Azure SQL DB. The Azure Data Sync engine then synchronizes the data between the HUB and the other DBs. If any of the DBs are on-premises, then an on-premises agent does the work.

Azure Data Sync (like Merge Replication) is trigger-based. Triggers are used to capture the changes ready for synchronization.

I wasn't a fan of Merge, and I can't say I'm a great fan of Azure SQL Data Sync. While it's conceptually simple, you would not want to use it for anything except very low volume applications.

Change Tracking

Change Tracking is another technology that's come directly from SQL Server land. When it's enabled, a set of change tracking tables are created. As data is changed in the tables of interest, changes are recorded in the change tracking tables.

One positive aspect of Change Tracking is that it isn't based on triggers and it outperforms trigger-based solutions. There are two downsides:

  • The changes are written synchronously, and in the context of the transaction that writes the change to the tracked table. This can impact the performance of the changes to the tracked table i.e. usually two writes are happening for each one that would have happened.
  • You don't get to see all the changes, and not in the order that they happened. Change Tracking lets you know which rows have changed, based upon the table's primary key. You can also ask to have a summary of which columns were changed). This can be a challenge for dealing with referential integrity, and other issues.

Queues (and Service Broker)

Another interesting option is to write to a queue. With an on-premises SQL Server, we can use Service Broker. If you haven't seen Service Broker, it's a transacted queue that lives inside the database. (To learn about this, look here).

With SQL CLR code or with External Activation for Service Broker, we could write to other types of queue like RabbitMQ.

At the current time, Azure SQL Database doesn't currently support writing to external queues. However, I do expect to see this change, as so many people have voted to have this capability added.

Change Data Capture

Change Data Capture (CDC) is another technology direct from SQL Server land. CDC is based on reading changes from a database's transaction log.

When you use it with SQL Server, it shares the same transaction log reader that Transactional Replication (TR) does. If you enable either CDC or TR, a log reader is started. If you have both enabled, they use a single log reader.

A key upside of using a log reader is that it doesn't slow down the initial updates to the target table. The changes are read asynchronously, separately.

Until recently, though, you could not use CDC with Azure SQL Database. The log reader agent ran from within SQL Server Agent, and with Azure SQL Database, you didn't have a SQL Server Agent.

The product team have recently done the work to make CDC work with Azure SQL Database.  It is an interesting option for extracting changes from a database, so this is the first blog post in a series of posts about using CDC with Azure SQL Database. Links to other posts will be added here as they are available:

  1. Why use Change Data Capture for Azure SQL Database?
  2. How Change Data Capture works in Azure SQL Database
  3. Enabling and using Change Data Capture in Azure SQL Database
  4. Change Data Capture and Azure SQL Database Service Level Objectives
  5. Accessing Change Data Capture Data from Another Azure SQL Database

 

 

Data Science summit 2022 – Warsaw (and Hybrid) – SQL Server 2022 T-SQL

I'm always excited when I can get involved in conferences with our Polish friends.

Coming up very soon is the Data Science Summit 2022: https://dssconf.pl/en/

For this summit, I'll be presenting a quick (around 40 minutes) session highlighting what's changed in T-SQL for SQL Server 2022. I'm always so glad to see T-SQL enhancements in SQL Server and SQL Server 2022 has more than what we've seen in other recent versions. There are a number of very important enhancements that will take a little while to get our heads around, on the best way to use them.

I've also seen the list of people presenting and the range of topics for the conference, and it really looks quite fascinating. There is content in Polish but the majority is in English so it's completely accessible for us English speakers.

I'd really love to see as many of you as possible attending, to support the Polish data community.

SDU Tools v22 is now available (finally)

One of our popular free resources is the SDU Tools library. If you haven't checked it out, I'd encourage you to do so. It's a large library of functions, procedures, and views all written in native T-SQL code.

You can easily use it as a complete library, or use it as examples of how to write T-SQL code. v22 is now available for download.

If you aren't on our notification list, you can add yourself here:

https://sdutools.sqldownunder.com

I'm sorry it's taken a while longer to get this version out than I would have liked, but we've finally caught up after a really busy period.

In v22, we've added the following:

  • Calculate Age in Months – think it's easy to work out ages? Just use DATEDIFF? If you think so, you'd be mistaken. DATEDIFF could even tell you that someone who is one day old is a month old. That's not true where I live. This function fixes that.
  • Last SQL Server Restart (work out when SQL Server was last restarted)
  • Languages (this is a great list of all the world's languages, categorized into their language families, includes the name of the language in English, and in the native language, and has the two and three character ISO codes for them.
  • List Constraints with System Names (you know you should name constraints explicitly and not let the system do it. This tool helps you find the ones that have slipped through)
  • File Path To File Extension, File Path To File Name, File Path To Folder Path – these functions take a full file path and extract out the extension, the file name, and the folder. They work with local file paths, UNC paths, and with Windows and Unix path delimiters.
  • List Untrusted Check Constraints, List Untrusted Foreign Keys – e previously provided tools to attempt to retrust untrusted check and foreign key constraints, but we realised we did not have a tool to find them in the first place. One new tool is used to list any check constraints that are currently untrusted. The other new tool does the same for untrusted foreign keys.

We've also done the usual updates to lists of SQL Server versions and builds, and a few patches to existing tools.

We hope it's useful.

Book: Implementing Power BI in the Enterprise

It's been a while coming, but my latest book is now out. Implementing Power BI in the Enterprise is now available in both paperback and eBook. The eBook versions are available in all Amazon stores, and also through most book distributors through Ingram Spark distribution.

I've had a few people ask about DRM-free ePub and PDF versions. While the Kindle version on Amazon is their normal DRM setup, you can purchase the DRM free version directly from us here:

https://sqldownunder.thrivecart.com/implementing-power-bi-ent-ebook/

It contains both the ePub and PDF versions.

Book Details

Power BI is an amazing tool. It's so easy to get started with and to develop a proof of concept. Enterprises want more than that. They need to create analytics using professional techniques.

There are many ways that you can do this but in this book, I've described how I implement these projects.  And it's gone well for many years over many projects.

If you want a book on building better visualizations in Power BI, this is not the book for you.

Instead, this book will teach you about architecture, identity and security, building a supporting data warehouse, using DevOps and project management tools, learning to use Azure Data Factory and source control with your projects.

It also describes how I implements projects for clients with differing levels of cloud tolerance, from the cloud natives, to cloud friendlies, to cloud conservatives, and to those clients who are not cloud friendly at all.

I also had a few people ask about the table of contents. The chapters are here:

  • Power BI Cloud Implementation Models
  • Other Tools That I Often Use
  • Working with Identity
  • Do you need a Data Warehouse?
  • Implementing the Data Model Schema
  • Implementing the Analytics Schema
  • Using DevOps for Project Management and Deployment
  • Staging, Loading and Transforming Data
  • Implementing ELT and Processing
  • Implementing the Tabular Model
  • Using Advanced Tabular Model Techniques
  • Connecting Power BI and Creating Reports

I hope you enjoy it.

SQL Day 2021 is on, and I'd love to see you in my Power BI pre-con

One of my favourite conferences each year is SQL Day. It's run by an enthusiastic group from Poland, and when I've attended in person, I loved it. This year it's virtual, and the upside of that, is you can attend from anywhere.

As part of the conference, I'm running a pre-con workshop. It's a low cost one day course on How I Implement Power BI in Enterprises. You'll find info on it here. The course is running on Poland time, but it looks to me like the times will suit a pretty wide variety of people, including from here in Australia.

More info here:

I'd love to see you there.

 

ADF: Where did "discard all changes" go in Azure Data Factory?

I'm a big fan of Azure Data Factory (ADF), but one of the things you need to get used to with tools like this, is that the UI keeps changing over time. That makes it hard for several reasons:

  • It's hard to train people. Any recorded demo you have will show them things that no longer exist, within a fairly short period of time.
  • Every time a cosmetic change occurs, it immediately devalues blog posts, tutorials, etc. that are out on the Internet.

I think Microsoft don't quite get how much blog posts, etc. supplement their own documentation.

It's the same with Power BI. I used to often teach a Power BI class on a Tuesday, as the second day of our week-long BI Core Skills class. And Tuesday was the day that Power BI changes would appear. So I'd be showing a class how to do something, and suddenly I'm trying to find where a tool I use regularly went.

So even people who work with these tools all the time, keep having odd moments where they go to click something they've used for ages, and it's just not there any more.

I had one of these moments the other day in ADF. I went to click on the menu item for "Discard all changes" and it had vanished. My mistake is that I kept looking through the menus and wondering where it went. I didn't notice that it had become a trash-can icon in the top-right of the screen.

So this is just a post for anyone else who's wondering where it went. (Or for someone following a tutorial or blog post and can't find the menu item).

BI: Can you explain where your analytic data came from?

I've seen many challenges with analytics over the years. One that's painful is an inability to explain where analytic data came from. Someone looks at a report, sees a value, and says I don't believe that number. Don't put yourself in that position !
Lineage
 
I load analytics from data warehouses. Most of my data warehouses are SQL Server databases of some type. Currently, they're almost always Azure SQL Databases. I like to include information in the database, about how the data got there i.e. the lineage of the data.
 

How can I record the lineage?

Most analytic data that I work with gets loaded/processed in batches. Sometimes it's overnight. Other times it's every few minutes. But either way, there's a process that's run, and it puts the data in place.
 
Each time the process runs, I put details in a table about the process run. And I put the key for that table into every row updated in that run.
 

What goes in the lineage table?

 
You might choose different values, but at a bare minimum I'd suggest:
When the process ran – I need to know when the data came in.
 
Which process – I need to know which process loaded and updated the data.
Source system – Which system/database/file/other source did the data come from?
 
Process identity – I need to know the identity that the process used when querying the source data. (Different identities might return different data from the source)
At least when someone asks about a row of data, you could at least say that it came from the XYZ source system at 12:24PM on May 12th 2021, and it was loaded and transformed by the PPPP package running as UUUU. You have a chance of establishing the validity of the data.