ADF: Use MSIs not SQL Logins whenever possible

Azure Data Factory (ADF) is great for moving data around. It often needs to connect to SQL databases. If you're creating linked services for this, please try to avoid using SQL logins. Don't use usernames and passwords for authentication if you can avoid them.
 
Azure SQL Database lets you can create a user from a data factory's managed service identity (MSI). MSIs are a special type of Service Principal (SP). They provide an identity where you don't need to manage the ID or any sort of password.  Azure does that for you.
 
Each ADF exposes an Azure-based identity on its property page. Here is the main menu for my BeanPerfection data factory:
Data factory properties page
You might have guessed you should go to the Managed identities menu option, but that is for User Managed identities. I'll write more about them another day. The identity we're interested in is a System Managed Identity.
On this properties page, it's actually called the Managed Identity Application ID. Like all other types of Service Principal, Managed Service Identities have an ApplicationID.

Creating a user in Azure SQL Database

I create a user in an Azure SQL Database with the FROM EXTERNAL PROVIDER option. This says that Azure SQL isn't performing the authentication.

The user name is the full name of the ADF. In Azure SQL Database, we can just use the name of the ADF instead of the identity. It will do that for us.

I always make this data factory name lower-case. Once the user is created, I add it to whatever role in the database makes sense. In my example above, I used the role name datafactory_access but there's nothing special about that name.

You need to decide which role to add it to based upon the tasks that the data factory needs to perform. While it's temptingly easy to just add it to db_owner, try to resist that. If in doubt, create a role that's used for data factory access and grant it permissions as they are needed.

Must execute using an AAD-Based Connection

If you just try to execute the statements above, you might find that you get an error message saying:

Msg 33159, Level 16, State 1, Line 8
Principal 'beanperfectiondatafactory' could not be created. Only connections established with Active Directory accounts can create other Active Directory users.

You cannot create any type of Azure AD-based user in an Azure SQL Database, if your connection was authenticated as a SQL login. You must use a connection that was itself made using Azure-AD authentication.

I find that the easiest way to do that, is to make sure I have an Azure Activity Directory Admin assigned for my Azure SQL Server, and then just execute the code right in the Azure Portal. I use the Query Editor tab in the main menu for the Azure SQL Database and connect as that administrator.

 

ADF: Where did "discard all changes" go in Azure Data Factory?

I'm a big fan of Azure Data Factory (ADF), but one of the things you need to get used to with tools like this, is that the UI keeps changing over time. That makes it hard for several reasons:

  • It's hard to train people. Any recorded demo you have will show them things that no longer exist, within a fairly short period of time.
  • Every time a cosmetic change occurs, it immediately devalues blog posts, tutorials, etc. that are out on the Internet.

I think Microsoft don't quite get how much blog posts, etc. supplement their own documentation.

It's the same with Power BI. I used to often teach a Power BI class on a Tuesday, as the second day of our week-long BI Core Skills class. And Tuesday was the day that Power BI changes would appear. So I'd be showing a class how to do something, and suddenly I'm trying to find where a tool I use regularly went.

So even people who work with these tools all the time, keep having odd moments where they go to click something they've used for ages, and it's just not there any more.

I had one of these moments the other day in ADF. I went to click on the menu item for "Discard all changes" and it had vanished. My mistake is that I kept looking through the menus and wondering where it went. I didn't notice that it had become a trash-can icon in the top-right of the screen.

So this is just a post for anyone else who's wondering where it went. (Or for someone following a tutorial or blog post and can't find the menu item).

SDU Podcast #82 with guest Kamil Nowinski now released

I recorded another podcast with an old friend recently. I met Kamil Nowinski in Poland when I was invited to speak at SQL Day. That event is on again this year, but as a virtual event only.

In our consulting work, we use Azure Data Factory frequently. Deploying it, can be a real challenge. Microsoft has a supplied method with ARM templates but that has issues. Kamil produced free tooling that's in the Azure Marketplace and makes it much easier to work with.

In this podcast, Kamil describes the overall process, and how the challenges arise with the Microsoft-supplied mechanism, and the way his tooling makes it easier.

If you want to understand more about Azure Data Factory (ADF) and how it interacts with source control, and deployment, this is a great show to listen to.

You'll find the podcast with Kamil (and all our other podcasts) here: https://podcast.sqldownunder.com

ADF: Keep linked service names the same across environments for Azure Data Factory

I recently recorded a podcast with Kamil Nowinsky. Kamil is well-known for creating some tools that make it easy to publish individual Azure Data Factory pipelines. It's a far better method that the standard ARM (Azure Resource Manager) template deployment that's provided by Microsoft.

However, if you use this method (and I recommend you do), you'll be publishing pipelines separate to the linked services that your datasets connect to.

One mistake I've seen a few clients making lately, is that when they created the linked services in different environments, they did so with slightly different names. Because of that, they needed to modify the deployed objects after deployment. You don't want to be doing that.

Data Factory datasets use the name of the linked service to identify it, so it's really important that when you're creating the linked services that you keep the name of the linked services consistent across the different environments where you'll be deploying pipelines and datasets.

Even though the linked services will be pointing to different servers, etc. in different environments, keeping the names of the linked services consistent will make your life much easier.

 

BI: Azure Data Factory Copy Activities Won't Start (Queued)

I love working with Azure Data Factory (ADF). There are times though, when I'm stuck wondering what's going on, and it takes me a moment to discover it.

One that had me puzzled recently was a Copy activity that just wouldn't seem to start. The pipeline had started ok. Previous activities in the pipeline had run just fine. But when it got to the Copy activity, it said "Queued" and just kept saying that seemingly endlessly.

Integration Runtimes

The first thing to understand is that these copy activities access datasets that are connected to via linked services:

Linked services

If we look at the properties of the types of linked services that we're copying to/from, you'll see that they in turn are accessed using Integration Runtimes.

Integration Runtime

Concurrency

The integration runtimes have concurrency limits.

Integration Runtime Properties

So one thing that could cause your Copy activity to queue is if the number of concurrent jobs was exceeding this limit.

However, in the case I'm talking about, this was the only Copy activity running.

Linked Shared Integration Runtimes

Integration Runtimes can be shared between data factories. The trick in this case was that the integration runtime was linked to one shared by another data factory. In fact, 9 data factories were linked to this same shared integration runtime.

What I should have checked was the Monitor for the linked shared integration runtime:

Integration Runtime Monitor

That would have shown the total of what was going on:

Integration Runtime Count

Previously, you couldn't see the totals from a data factory that was linking. You needed to check from the data factory that was sharing.

The Cause

My Copy activity seemed to be queued for no reason. But of course there was a reason. Another data factory had a bug and was consuming all the available concurrency from the shared integration runtime.

 

ADF: Time zone support in Data Factory – a Small Change but so Important

I work with a lot of technology on a daily basis, much of it amazing. But I still get excited when relatively small enhancements are made, and they make my life or development much better.

Timezone support in schedule triggers in Azure Data Factory is one of those changes.

Schedule Triggers

In Data Factory, the most common type of trigger that we use is a schedule trigger. It's used to run a pipeline at a specific time or series of times. But one of the most painful aspects of these triggers is that they didn't have time zone support.

That meant that for a town like Melbourne or Sydney where daylight savings time applied, I couldn't set a pipeline to just run at say 2AM every day, Melbourne time. Every six months, I'd have to go in and change the trigger times to achieve that.

I'd also have to endlessly convert between our local times and UTC as that's the time that would be entered.

If you only have a few triggers, that might be OK, but when you have a lot of them, well that's painful.

Design was Ready

When I looked at the JSON representation of the trigger, there was a time zone in there:

So I knew that whoever designed it meant for it to work. But ADF wouldn't support any other value than UTC.

Great News !

In the last few days, we got the great news that time zones are now supported in these triggers.

You can see the option now in the UI:

I can't tell you how pleased I am to see that there.

Thank you ADF team !

 

ADF: Changing or removing your Git connection from an Azure Data Factory

I've been working a lot with ADF (Azure Data Factory) again lately. If you've started to work with ADF and aren't using Git in conjunction with it, you need to change that now.

ADF objects are also represented as JSON objects and lend themselves nicely to being stored in Git.

Another key advantage is that if you don't have Git connected, when you're working with ADF, you don't have a Save button. If you're part way through making changes, you can either Publish them or Discard All. When you have Git in place, you get an option to Save.

NOTE: If I'm working at a site that doesn't have Git connected, if I need to temporarily stop working on something, what I can do is get the JSON code for the pipeline (or whatever I'm working on), save it away manually, then discard my changes. At least I can then put it back again later. But that's basically what the Git option does for you anyway, so just use it.

Changing or Removing the Git Connection

Today was the first time I'd needed to change the Git connection across to another repository. And it took me a while to work out how. You'll notice that in the main editing window, there's no option to change it:

Now I think it should be there. But if not, it should be on the Properties page for the ADF instance. But it's not there either:

Now I think it should be in one of those two locations.

So where is it?

In one of the least obvious bits of UI ever created, it's tucked up in the top right corner of the Dashboard. That's the last place I would have looked. That page is usually useless to me. It's great if you've just started, as it has lots of links to tutorials, etc. but otherwise, not so interesting. But that's precisely where the option is:

I hope that helps someone.