Azure Data Factory (ADF) – Issues with parsing strings in CSVs

It's scary how much of the world's data lives in CSVs. Yet, as a standard, it's problematic. So many implementations just don't work as expected. Today, I had a client asking about why Azure Data Factory wouldn't read a CSV that, while odd, was in a valid format.

The simplest equivalent of the file that wouldn't load, would be one like this:

There are meant to be four columns. The source system wrapped strings in quotes only when the string contained a comma, as that was the delimiter for the file.

But the fourth line in that file would not load. ADF would split it across multiple columns. It was not interpreting the double-double-quotes.

I was puzzled at first, as the file seemed ok. If I upload that file an any of the common online CSV file validators or linters, they all said it was ok.

The client had configured the dataset as follows:

The options seemed ok to them.

Turns out the issue was with the escape character. ADF defaulted to using a backslash. For the majority of CSV files that I see, it would need to be a double-quote:

And sure enough, with that change in place, it loaded files similar to the one above without issue.

I don't know why that's not the default for CSV files in ADF but I do hope it helps someone else.

ADF: Passing a file attachment via a POST to a REST service using Azure Data Factory

I was recently doing work for a client where I needed to replace some Python code that was being used to upload a file attachment to a REST based service using a POST command. I was trying to do it using a Web Activity in Azure Data Factory. What was surprising was how little information I could find on how to do that.

While I can't post all the code, I did want to post the core parts here in case it helps others, and for my own future reference. The basic steps involve:

  • Put the file into a storage account
  • Read the file into a variable
  • Construct the body of the POST call, including the attachment
  • Send the POST call using a Web Activity
  • Test the REST-based call

Put the file into a storage account

Ideally, the Web Activity would have a way to just point to a file in an Azure Storage Account and get it to attach that file to calls. It doesn't have that option.

What it does have is an option to include a dataset in the body of the call.

Add dataset UI

I was immediately hopeful and tried that, and while it was interesting, it didn't include the data as an attachment. It just appended it into the body of the call.

So let's load up a file into a variable and use the variable to provide part of the body of the web call.

First, I've created a resource group called rest_post_test using the Azure portal, then created a storage account called greglowfileupload. I created a text file called cities.csv with the following contents:

List of cities

And I uploaded it to a container called uploadfiles.

Cities file in a storage container

Read the file into a variable

I created an Azure pipeline called RestUploadTest, and added the following parameters:

Pipeline parameters

The APIKey was provided by the REST service and is used to log on, instead of using other forms of authentication.

The SiteURL is the address that the POST is being sent to. I've used a test site that I'll talk about in a moment.

The ContainerName, FolderName, and FileName are the location of the file in the storage account. (It is not in a subfolder)

All of these values would normally be read from a database, but for simplicity, I've made them parameters to the pipeline.

I then added the following variables:

Pipeline variables

Next I added a linked service StorageFile to connect to the storage account.

Linked service definition

Note that I would normally use a Key Vault for holding key details for the storage account, but I've gone for simplicity in this demo.

To read the file from storage, we can use a Lookup Activity, but it requires a Dataset, so we create the dataset first. I called it StorageTextFile.  I like to make datasets reusable, so I added parameters to it:

Dataset parameters

Then I configured those parameters as dynamic values in the Connection tab:

Dataset connection tab

Dataset connection tab 2

Note that because I used a Delimited Text File as the dataset type, I had to configure a column and row delimiter. I chose values that will not occur in my file. I used the preview option to make sure I could read the data.

In the pipeline, I added a Lookup Activity called Get File Contents, configured the timeout as 20 minutes (not the default of 12 hours !) and set the details as follows:

Get file contents activity

The dataset properties were populated dynamically from the pipeline parameters.

Finally, I added a Set Variable Activity called Set AttachmentFileContents.

Set attachment file contents

The dynamic code for setting the variable was:

I used the output of the Get file contents activity, and got the first row from it. Finally, because I didn't use the first row as a header row, it allocated its own headers. In this case, I have one column called Prop_0.

And then tested the code up to this point. I clicked on the output of the Set AttachmentFileContents Activity and can see the values. (Encoded in JSON in this viewer).

First test ok

All good to here. We have the file loaded into a variable.

Construct the body of the POST call, including the attachment

The next two things I need to configure are the variables that are holding the content type and body for the web request. I put them in variables separately to make it easier to debug.

First, I configured the web request content type variable:

Web request content type

The value was actually:

This is multipart form data, and because the body can contain many values, you need a boundary that's used to separate them.

Then the body values:

Web request body

The value was comprised of:

Note: the positioning of the carriage return/linefeeds in this is critical. You can't just concat them by using '\r' or '\n'. I've found that doesn't work. The line breaks need to be in the dynamic values.

Also note: you'll see the boundary value in here, but the first time I used it, I added a double-dash prefix, and at the end, I used a double-dash prefix and suffix.

Now we have the pieces ready for the Web Activity.

Note on Line Endings

I found this works for most web servers, but some web servers are fussy about how new lines are formed. Some are happy to just get a line feed (LF) character, others want a carriage return and line feed (CRLF).

If you only supply a LF and it wants a CRLF, the error returned is normally one that says you've reached the end of the stream unexpectedly (or similar words). In that case, you'll need to adjust the body to have consistent line endings with CRLF.

The replace() function in ADF works fine, but you can't just specify '\n' or '\r', etc. I've found you need to use the base64 encodings to get what's needed. (Suggestion from fellow MVP Simon Sabin)

They are as follows:

  • Instead of '\r' write base64ToString('DQ==')
  • Instead of '\n' write base64ToString('Cg==')
  • Instead of '\r\n' write base64ToString('DQo=')

Send the POST call using a Web Activity

I then added a Web Activity called Upload the file attachment:

Web activity

Note the Authorization header contained:

That supplies the API key as the bearer.

Note also: you shouldn't need the Accept header but I've found some services complain if it isn't there. The other reason for having it set to application/json is so that when an error occurs, you can read it without having to decode it as you would with gzip compression, etc.

Uploading Larger Files

If you are uploading larger files, there are two things to consider:

  • I'm not sure what the current maximum size is for an ADF variable, but the method that I've used assumes that the whole file will fit in a variable. It's pretty large though.
  • In the advanced settings for the Web Activity, there is a setting for an HTTP Request Timeout. That defaults to 1 minute. If the upload will take longer than that, you'll need to increase it. It has a 10 minute maximum value.

HTTP request timeout

Test the REST-based call

Now we get to test it. What we need is something to catch the requests while we're testing. The easiest site I've found for this is webhook.site. You can set up a site that listens on a URL for free. It's awesome. When you start it, it gives you a temporary URL to use:

That's the URL that I used earlier as a pipeline parameter. Let's test the POST by debugging the pipeline.

First, we make sure the pipeline ran as expected:

When we check the website, we see the request received, and with the file attachment!

Post received

Lower-level debugging

While this site is great when debugging most things, at first I struggled with it. The problem was that when I had malformed requests, I wished that it would just show me the raw data that it was receiving. I needed to compare my requests to other well-formed requests. I asked the authors and it can't do that. (At least not right now)

What did help, is that I found a GitHub project from Janne Mattila, called Echo. You'll find it here: https://github.com/JanneMattila/Echo. It looks like Janne also has a hosted version of it available here.

I decided to build that project and deploy it to an Azure Website, in case I wanted to modify it. And it was so very helpful. It just shows you the detail of everything it receives, rather than formatting the request:

Echo output

Thanks Janne !

I know this has been a longer blog post than normal but I hope it helps someone work this out.

 

ADF: Replacing carriage returns in Data Factory expression language

This one had me stumped lately.

I had an variable in ADF where the value contained carriage returns. I needed to remove them.

I guessed that would be easy:

@replace(variables('fileContent'), '\r', ")

But no matter what variation of this I tried, it wouldn't remove them.

Thanks to the genius of fellow MVP Simon Sabin, the answer was:

@replace(variables('fileContent'),base64ToString('DQ=='),")

Note: if you need a line feed (LF) it's base64ToString('Cg==') and a carriage return / line feed pair (CRLF), is base64ToString('DQo=').

Thanks Simon, and I hope this helps someone else.

Note: if you need a linefeed it's base64ToString('Cg==') and a carriage return / line feed pair, is base64ToString('DQo=').

FIX: Data Factory ODBC linked service fails to Apply and returns Internal Server Error

I was working with a client who has having trouble debugging an ADF pipeline, related to an ODBC linked service not working as expected.

The user had configured the connection string property of an ODBC connection this way:

  •  Set a parameter to the linked service as ServiceDSN
  •  Configured the connection string as @concat('DSN=',linkedService().ServiceDSN)

The Test Connection for that worked fine, but when you click Apply, it fails with an Internal Server Error. Specifically, the error was:

Failed to encrypt linked service credentials on linked self-hosted IR 'XXX-XXXX01' through service bus, reason is: InternalServerError, error message is: Internal Server Error

Absolutely no idea why. Has to be a bug.

Other Symptoms

What it had done though, is leave the linked service connection details in a weird state. Trying to access it via the Test Connection option at the dataset level, showed Data source name too long.

What finally gave me a clue, is that when I looked at the error message in the log on the shared integration runtime, it actually said Data source name too long not found in mapping in connector. Now apart from the lousy English on this one, it's interesting that in the UI, only the first part of the message surfaced. The additional not found part was a major hint. It wasn't finding the entry for the linked service, for the interactive mode used during debugging in the portal.

Solution

Anyway, the solution was to configure the connection string as DSN=@{linkedService().ServiceDSN} instead. That tests just the same, but doesn't throw the internal server error when you try to apply it. And it works as expected.

No idea why the way you construct the connection string matters, as they both return the same string, but it does. Both methods test fine, but one explodes when you try to apply it.

Another Related Error

One other thing I saw periodically during testing was an error that said:

Format of the initialization string does not conform to specification starting at index 0

This error occurs if the connection string happens to just contain the DSN and not the string with the DSN= prefix.

Hope any/all of these help someone else.

ADF: Use MSIs not SQL Logins whenever possible

Azure Data Factory (ADF) is great for moving data around. It often needs to connect to SQL databases. If you're creating linked services for this, please try to avoid using SQL logins. Don't use usernames and passwords for authentication if you can avoid them.
 
Azure SQL Database lets you can create a user from a data factory's managed service identity (MSI). MSIs are a special type of Service Principal (SP). They provide an identity where you don't need to manage the ID or any sort of password.  Azure does that for you.
 
Each ADF exposes an Azure-based identity on its property page. Here is the main menu for my BeanPerfection data factory:
Data factory properties page
You might have guessed you should go to the Managed identities menu option, but that is for User Managed identities. I'll write more about them another day. The identity we're interested in is a System Managed Identity.
On this properties page, it's actually called the Managed Identity Application ID. Like all other types of Service Principal, Managed Service Identities have an ApplicationID.

Creating a user in Azure SQL Database

I create a user in an Azure SQL Database with the FROM EXTERNAL PROVIDER option. This says that Azure SQL isn't performing the authentication.

The user name is the full name of the ADF. In Azure SQL Database, we can just use the name of the ADF instead of the identity. It will do that for us.

I always make this data factory name lower-case. Once the user is created, I add it to whatever role in the database makes sense. In my example above, I used the role name datafactory_access but there's nothing special about that name.

You need to decide which role to add it to based upon the tasks that the data factory needs to perform. While it's temptingly easy to just add it to db_owner, try to resist that. If in doubt, create a role that's used for data factory access and grant it permissions as they are needed.

Must execute using an AAD-Based Connection

If you just try to execute the statements above, you might find that you get an error message saying:

Msg 33159, Level 16, State 1, Line 8
Principal 'beanperfectiondatafactory' could not be created. Only connections established with Active Directory accounts can create other Active Directory users.

You cannot create any type of Azure AD-based user in an Azure SQL Database, if your connection was authenticated as a SQL login. You must use a connection that was itself made using Azure-AD authentication.

I find that the easiest way to do that, is to make sure I have an Azure Activity Directory Admin assigned for my Azure SQL Server, and then just execute the code right in the Azure Portal. I use the Query Editor tab in the main menu for the Azure SQL Database and connect as that administrator.

 

ADF: Where did "discard all changes" go in Azure Data Factory?

I'm a big fan of Azure Data Factory (ADF), but one of the things you need to get used to with tools like this, is that the UI keeps changing over time. That makes it hard for several reasons:

  • It's hard to train people. Any recorded demo you have will show them things that no longer exist, within a fairly short period of time.
  • Every time a cosmetic change occurs, it immediately devalues blog posts, tutorials, etc. that are out on the Internet.

I think Microsoft don't quite get how much blog posts, etc. supplement their own documentation.

It's the same with Power BI. I used to often teach a Power BI class on a Tuesday, as the second day of our week-long BI Core Skills class. And Tuesday was the day that Power BI changes would appear. So I'd be showing a class how to do something, and suddenly I'm trying to find where a tool I use regularly went.

So even people who work with these tools all the time, keep having odd moments where they go to click something they've used for ages, and it's just not there any more.

I had one of these moments the other day in ADF. I went to click on the menu item for "Discard all changes" and it had vanished. My mistake is that I kept looking through the menus and wondering where it went. I didn't notice that it had become a trash-can icon in the top-right of the screen.

So this is just a post for anyone else who's wondering where it went. (Or for someone following a tutorial or blog post and can't find the menu item).

SDU Podcast #82 with guest Kamil Nowinski now released

I recorded another podcast with an old friend recently. I met Kamil Nowinski in Poland when I was invited to speak at SQL Day. That event is on again this year, but as a virtual event only.

In our consulting work, we use Azure Data Factory frequently. Deploying it, can be a real challenge. Microsoft has a supplied method with ARM templates but that has issues. Kamil produced free tooling that's in the Azure Marketplace and makes it much easier to work with.

In this podcast, Kamil describes the overall process, and how the challenges arise with the Microsoft-supplied mechanism, and the way his tooling makes it easier.

If you want to understand more about Azure Data Factory (ADF) and how it interacts with source control, and deployment, this is a great show to listen to.

You'll find the podcast with Kamil (and all our other podcasts) here: https://podcast.sqldownunder.com

ADF: Keep linked service names the same across environments for Azure Data Factory

I recently recorded a podcast with Kamil Nowinsky. Kamil is well-known for creating some tools that make it easy to publish individual Azure Data Factory pipelines. It's a far better method that the standard ARM (Azure Resource Manager) template deployment that's provided by Microsoft.

However, if you use this method (and I recommend you do), you'll be publishing pipelines separate to the linked services that your datasets connect to.

One mistake I've seen a few clients making lately, is that when they created the linked services in different environments, they did so with slightly different names. Because of that, they needed to modify the deployed objects after deployment. You don't want to be doing that.

Data Factory datasets use the name of the linked service to identify it, so it's really important that when you're creating the linked services that you keep the name of the linked services consistent across the different environments where you'll be deploying pipelines and datasets.

Even though the linked services will be pointing to different servers, etc. in different environments, keeping the names of the linked services consistent will make your life much easier.

 

BI: Azure Data Factory Copy Activities Won't Start (Queued)

I love working with Azure Data Factory (ADF). There are times though, when I'm stuck wondering what's going on, and it takes me a moment to discover it.

One that had me puzzled recently was a Copy activity that just wouldn't seem to start. The pipeline had started ok. Previous activities in the pipeline had run just fine. But when it got to the Copy activity, it said "Queued" and just kept saying that seemingly endlessly.

Integration Runtimes

The first thing to understand is that these copy activities access datasets that are connected to via linked services:

Linked services

If we look at the properties of the types of linked services that we're copying to/from, you'll see that they in turn are accessed using Integration Runtimes.

Integration Runtime

Concurrency

The integration runtimes have concurrency limits.

Integration Runtime Properties

So one thing that could cause your Copy activity to queue is if the number of concurrent jobs was exceeding this limit.

However, in the case I'm talking about, this was the only Copy activity running.

Linked Shared Integration Runtimes

Integration Runtimes can be shared between data factories. The trick in this case was that the integration runtime was linked to one shared by another data factory. In fact, 9 data factories were linked to this same shared integration runtime.

What I should have checked was the Monitor for the linked shared integration runtime:

Integration Runtime Monitor

That would have shown the total of what was going on:

Integration Runtime Count

Previously, you couldn't see the totals from a data factory that was linking. You needed to check from the data factory that was sharing.

The Cause

My Copy activity seemed to be queued for no reason. But of course there was a reason. Another data factory had a bug and was consuming all the available concurrency from the shared integration runtime.

 

ADF: Time zone support in Data Factory – a Small Change but so Important

I work with a lot of technology on a daily basis, much of it amazing. But I still get excited when relatively small enhancements are made, and they make my life or development much better.

Timezone support in schedule triggers in Azure Data Factory is one of those changes.

Schedule Triggers

In Data Factory, the most common type of trigger that we use is a schedule trigger. It's used to run a pipeline at a specific time or series of times. But one of the most painful aspects of these triggers is that they didn't have time zone support.

That meant that for a town like Melbourne or Sydney where daylight savings time applied, I couldn't set a pipeline to just run at say 2AM every day, Melbourne time. Every six months, I'd have to go in and change the trigger times to achieve that.

I'd also have to endlessly convert between our local times and UTC as that's the time that would be entered.

If you only have a few triggers, that might be OK, but when you have a lot of them, well that's painful.

Design was Ready

When I looked at the JSON representation of the trigger, there was a time zone in there:

So I knew that whoever designed it meant for it to work. But ADF wouldn't support any other value than UTC.

Great News !

In the last few days, we got the great news that time zones are now supported in these triggers.

You can see the option now in the UI:

I can't tell you how pleased I am to see that there.

Thank you ADF team !