Book Review: Azure Data Factory Cookbook – 2nd Edition

The people at PackT recently sent me a book to review, and I was happy to do so as it was on a topic that's dear to my heart: Azure Data Factory. The book was Azure Data Factory Cookbook and it's the second edition of the book. The authors are Dmitry Foshin, Tonya Chernyshova, Dmitry Anoshin, and Xenia Ireton.

PackT

In the past, I wasn't keen on PackT books. When they first appeared, they tended to be low cost books from unknown authors, many of whom struggled with writing in English, and pretty poor editing of the content.

I'm really pleased to see how this has changed in recent times. The authors of most of their books are now people who are knowledgeable about the topics, write well in English, and the editing has improved out of sight.

Not sure how that was achieved, but am really pleased to see that it has.

In terms of production, there are only two comments I'd make:

  • I find the font style, size, etc. still harder to read than an equivalent book from, say, Apress. I find the books harder to read for long periods.
  • I know it's hard to ask for colour, but I have to agree with one of the reviewers on Amazon who commented that the lack of colour make some of the pictures hard to read.

Other than that, the book was large, solid, and well-presented.

Content Style

I like books that are cookbook style. I used to think the same about books on topics like MDX and DAX. There is a place for books that teach the theory but often what people need once they get past the basics, are books that just say "if you're trying to achieve this, do this", and have a big list of recipes.

This book does that. Most of the topics are covered with walkthroughs that step you through how to do a task. I liked that approach.

Topic Coverage

This book covers a lot of topics. Given the title of the book was about ADF, I was really suprised to see the breadth of topics that were covered. The subtitle is A data engineer's guide to building and managing ETL and ELT pipelines with data integration. And that gives a clue to the fact that the coverage is much, much broader than ADF.

I was surprised to see so much coverage of pipelines in other places like Synapse Analytics, Fabric, etc. but more surprised to see coverage of HDInsight and big data concepts. I can't remember the last time I saw anyone using HDInsight. I always thought it was seriously over-hyped while it was being promoted, and still think the same way.

It made more sense to see a bunch of coverage of Databricks, delta tables and integrating ADF with Azure Machine Learning, Azure Logic Apps, Azure Functions and more.  They are relatively common areas of integration for ADF, along with migrating on-premises SSIS packages to ADF.

Note: in general, I don't like migrating SSIS packages to ADF in any way except rewriting them. Most of my customers never complain about the cost of using ADF. The only ones I hear complaining are people who use either the SSIS runtime for ADF or those using dataflows in ADF. (I don't like using those either)

Summary

The book is substantial, well written, and comprehensive.

What I really would have liked is more ADF content. I don't want the book to be larger, but for a book with this title, I'd prefer more depth on how to do things in ADF and less on other related but ancilliary topics.

7 out of 10

ADF: Fix – The required Blob is missing – Azure Data Factory – Get Metadata

I was doing work for a client that uses Azure Data Factory (ADF), and noticed an unexpected Error code 2011 from a Get Metadata activity.

The pipeline was using a Get Metadata activity to read a list of files in a folder in Azure Storage, so that it could use a ForEach activity to process each of the files. Eventually it became clear that the error occurred when the folder was empty.

Storage Folders

Originally with Azure Storage, folders were only a virtual concept and didn't really exist. You couldn't have an empty folder.

So what we used to do back then is to place a file that we called Placeholder.txt into any folder as soon as we created it. That way the folder would stay in place. It's a similar deal to folders in Git i.e., you can't have a folder without a file.

It also meant that after using a Get Metadata activity to find the list of files in a folder, we'd have to use a Filter activity to exclude any files like our placeholder file that weren't going to get processed.

Hierarchical Namespaces

But that all changed with the introduction of hierarchical namespaces in Azure Storage. Once you enabled this for an account, folders were first class citizens and could exist whether or not they contained files.

This is a good thing as it also made access to files in a folder much faster.

It was part of the support that was needed to use Azure Data Lake Storage (ADLS) Gen 2.

So what was wrong?

The issue in this case, was that the perception was that the  placeholder file no longer needed to be there.

Everything else worked, but, by design, if you try to get the metadata for an empty folder, even in a hierarchical namespace, error 2011 is thrown.

That seemed really odd to me.

The fix is in

I talked about this to members of the product group and they gave me the details of what was going wrong.

Even though you can connect to Azure Storage using the Azure Storage Blob connector, even if you've enabled hierarchical namespaces, you'll still get an error if you try to list items in an empty folder. Can't say I love that, but it is what it is.

The trick is that you must use the Azure Data Lake Storage (ADLS) Gen 2 connector instead of the Azure Storage Blob connector. If you do that, you get back an empty ChildItems array, just as expected. No error is thrown.

Modifying existing code

Now, changing the type of connector is a bit painful as you also have to change the type of dataset, and that could mean unpicking it from wherever it's used and then hoping you set it all back up again ok.

Instead, this is what I've done:

  • Add a new linked service for ADLS2
  • Modify the JSON for the existing dataset to change the location property type from AzureBlobStorageLocation to AzureBlobFSLocation and change the referenceName to the new linked service
  • Delete the previous storage linked service.

And that should all just work without a bunch of reconfiguring.

I hope that helps someone.

 

 

 

 

Azure Data Factory (ADF) – Issues with parsing strings in CSVs

It's scary how much of the world's data lives in CSVs. Yet, as a standard, it's problematic. So many implementations just don't work as expected. Today, I had a client asking about why Azure Data Factory wouldn't read a CSV that, while odd, was in a valid format.

The simplest equivalent of the file that wouldn't load, would be one like this:

There are meant to be four columns. The source system wrapped strings in quotes only when the string contained a comma, as that was the delimiter for the file.

But the fourth line in that file would not load. ADF would split it across multiple columns. It was not interpreting the double-double-quotes.

I was puzzled at first, as the file seemed ok. If I upload that file an any of the common online CSV file validators or linters, they all said it was ok.

The client had configured the dataset as follows:

The options seemed ok to them.

Turns out the issue was with the escape character. ADF defaulted to using a backslash. For the majority of CSV files that I see, it would need to be a double-quote:

And sure enough, with that change in place, it loaded files similar to the one above without issue.

I don't know why that's not the default for CSV files in ADF but I do hope it helps someone else.

ADF: Passing a file attachment via a POST to a REST service using Azure Data Factory

I was recently doing work for a client where I needed to replace some Python code that was being used to upload a file attachment to a REST based service using a POST command. I was trying to do it using a Web Activity in Azure Data Factory. What was surprising was how little information I could find on how to do that.

While I can't post all the code, I did want to post the core parts here in case it helps others, and for my own future reference. The basic steps involve:

  • Put the file into a storage account
  • Read the file into a variable
  • Construct the body of the POST call, including the attachment
  • Send the POST call using a Web Activity
  • Test the REST-based call

Put the file into a storage account

Ideally, the Web Activity would have a way to just point to a file in an Azure Storage Account and get it to attach that file to calls. It doesn't have that option.

What it does have is an option to include a dataset in the body of the call.

Add dataset UI

I was immediately hopeful and tried that, and while it was interesting, it didn't include the data as an attachment. It just appended it into the body of the call.

So let's load up a file into a variable and use the variable to provide part of the body of the web call.

First, I've created a resource group called rest_post_test using the Azure portal, then created a storage account called greglowfileupload. I created a text file called cities.csv with the following contents:

List of cities

And I uploaded it to a container called uploadfiles.

Cities file in a storage container

Read the file into a variable

I created an Azure pipeline called RestUploadTest, and added the following parameters:

Pipeline parameters

The APIKey was provided by the REST service and is used to log on, instead of using other forms of authentication.

The SiteURL is the address that the POST is being sent to. I've used a test site that I'll talk about in a moment.

The ContainerName, FolderName, and FileName are the location of the file in the storage account. (It is not in a subfolder)

All of these values would normally be read from a database, but for simplicity, I've made them parameters to the pipeline.

I then added the following variables:

Pipeline variables

Next I added a linked service StorageFile to connect to the storage account.

Linked service definition

Note that I would normally use a Key Vault for holding key details for the storage account, but I've gone for simplicity in this demo.

To read the file from storage, we can use a Lookup Activity, but it requires a Dataset, so we create the dataset first. I called it StorageTextFile.  I like to make datasets reusable, so I added parameters to it:

Dataset parameters

Then I configured those parameters as dynamic values in the Connection tab:

Dataset connection tab

Dataset connection tab 2

Note that because I used a Delimited Text File as the dataset type, I had to configure a column and row delimiter. I chose values that will not occur in my file. I used the preview option to make sure I could read the data.

In the pipeline, I added a Lookup Activity called Get File Contents, configured the timeout as 20 minutes (not the default of 12 hours !) and set the details as follows:

Get file contents activity

The dataset properties were populated dynamically from the pipeline parameters.

Finally, I added a Set Variable Activity called Set AttachmentFileContents.

Set attachment file contents

The dynamic code for setting the variable was:

I used the output of the Get file contents activity, and got the first row from it. Finally, because I didn't use the first row as a header row, it allocated its own headers. In this case, I have one column called Prop_0.

And then tested the code up to this point. I clicked on the output of the Set AttachmentFileContents Activity and can see the values. (Encoded in JSON in this viewer).

First test ok

All good to here. We have the file loaded into a variable.

Construct the body of the POST call, including the attachment

The next two things I need to configure are the variables that are holding the content type and body for the web request. I put them in variables separately to make it easier to debug.

First, I configured the web request content type variable:

Web request content type

The value was actually:

This is multipart form data, and because the body can contain many values, you need a boundary that's used to separate them.

Then the body values:

Web request body

The value was comprised of:

Note: the positioning of the carriage return/linefeeds in this is critical. You can't just concat them by using '\r' or '\n'. I've found that doesn't work. The line breaks need to be in the dynamic values.

Also note: you'll see the boundary value in here, but the first time I used it, I added a double-dash prefix, and at the end, I used a double-dash prefix and suffix.

Now we have the pieces ready for the Web Activity.

Note on Line Endings

I found this works for most web servers, but some web servers are fussy about how new lines are formed. Some are happy to just get a line feed (LF) character, others want a carriage return and line feed (CRLF).

If you only supply a LF and it wants a CRLF, the error returned is normally one that says you've reached the end of the stream unexpectedly (or similar words). In that case, you'll need to adjust the body to have consistent line endings with CRLF.

The replace() function in ADF works fine, but you can't just specify '\n' or '\r', etc. I've found you need to use the base64 encodings to get what's needed. (Suggestion from fellow MVP Simon Sabin)

They are as follows:

  • Instead of '\r' write base64ToString('DQ==')
  • Instead of '\n' write base64ToString('Cg==')
  • Instead of '\r\n' write base64ToString('DQo=')

Send the POST call using a Web Activity

I then added a Web Activity called Upload the file attachment:

Web activity

Note the Authorization header contained:

That supplies the API key as the bearer.

Note also: you shouldn't need the Accept header but I've found some services complain if it isn't there. The other reason for having it set to application/json is so that when an error occurs, you can read it without having to decode it as you would with gzip compression, etc.

Uploading Larger Files

If you are uploading larger files, there are two things to consider:

  • I'm not sure what the current maximum size is for an ADF variable, but the method that I've used assumes that the whole file will fit in a variable. It's pretty large though.
  • In the advanced settings for the Web Activity, there is a setting for an HTTP Request Timeout. That defaults to 1 minute. If the upload will take longer than that, you'll need to increase it. It has a 10 minute maximum value.

HTTP request timeout

Test the REST-based call

Now we get to test it. What we need is something to catch the requests while we're testing. The easiest site I've found for this is webhook.site. You can set up a site that listens on a URL for free. It's awesome. When you start it, it gives you a temporary URL to use:

That's the URL that I used earlier as a pipeline parameter. Let's test the POST by debugging the pipeline.

First, we make sure the pipeline ran as expected:

When we check the website, we see the request received, and with the file attachment!

Post received

Lower-level debugging

While this site is great when debugging most things, at first I struggled with it. The problem was that when I had malformed requests, I wished that it would just show me the raw data that it was receiving. I needed to compare my requests to other well-formed requests. I asked the authors and it can't do that. (At least not right now)

What did help, is that I found a GitHub project from Janne Mattila, called Echo. You'll find it here: https://github.com/JanneMattila/Echo. It looks like Janne also has a hosted version of it available here.

I decided to build that project and deploy it to an Azure Website, in case I wanted to modify it. And it was so very helpful. It just shows you the detail of everything it receives, rather than formatting the request:

Echo output

Thanks Janne !

I know this has been a longer blog post than normal but I hope it helps someone work this out.

 

ADF: Replacing carriage returns in Data Factory expression language

This one had me stumped lately.

I had an variable in ADF where the value contained carriage returns. I needed to remove them.

I guessed that would be easy:

@replace(variables('fileContent'), '\r', ")

But no matter what variation of this I tried, it wouldn't remove them.

Thanks to the genius of fellow MVP Simon Sabin, the answer was:

@replace(variables('fileContent'),base64ToString('DQ=='),")

Note: if you need a line feed (LF) it's base64ToString('Cg==') and a carriage return / line feed pair (CRLF), is base64ToString('DQo=').

Thanks Simon, and I hope this helps someone else.

Note: if you need a linefeed it's base64ToString('Cg==') and a carriage return / line feed pair, is base64ToString('DQo=').

FIX: Data Factory ODBC linked service fails to Apply and returns Internal Server Error

I was working with a client who has having trouble debugging an ADF pipeline, related to an ODBC linked service not working as expected.

The user had configured the connection string property of an ODBC connection this way:

  •  Set a parameter to the linked service as ServiceDSN
  •  Configured the connection string as @concat('DSN=',linkedService().ServiceDSN)

The Test Connection for that worked fine, but when you click Apply, it fails with an Internal Server Error. Specifically, the error was:

Failed to encrypt linked service credentials on linked self-hosted IR 'XXX-XXXX01' through service bus, reason is: InternalServerError, error message is: Internal Server Error

Absolutely no idea why. Has to be a bug.

Other Symptoms

What it had done though, is leave the linked service connection details in a weird state. Trying to access it via the Test Connection option at the dataset level, showed Data source name too long.

What finally gave me a clue, is that when I looked at the error message in the log on the shared integration runtime, it actually said Data source name too long not found in mapping in connector. Now apart from the lousy English on this one, it's interesting that in the UI, only the first part of the message surfaced. The additional not found part was a major hint. It wasn't finding the entry for the linked service, for the interactive mode used during debugging in the portal.

Solution

Anyway, the solution was to configure the connection string as DSN=@{linkedService().ServiceDSN} instead. That tests just the same, but doesn't throw the internal server error when you try to apply it. And it works as expected.

No idea why the way you construct the connection string matters, as they both return the same string, but it does. Both methods test fine, but one explodes when you try to apply it.

Another Related Error

One other thing I saw periodically during testing was an error that said:

Format of the initialization string does not conform to specification starting at index 0

This error occurs if the connection string happens to just contain the DSN and not the string with the DSN= prefix.

Hope any/all of these help someone else.

ADF: Use MSIs not SQL Logins whenever possible

Azure Data Factory (ADF) is great for moving data around. It often needs to connect to SQL databases. If you're creating linked services for this, please try to avoid using SQL logins. Don't use usernames and passwords for authentication if you can avoid them.
 
Azure SQL Database lets you can create a user from a data factory's managed service identity (MSI). MSIs are a special type of Service Principal (SP). They provide an identity where you don't need to manage the ID or any sort of password.  Azure does that for you.
 
Each ADF exposes an Azure-based identity on its property page. Here is the main menu for my BeanPerfection data factory:
Data factory properties page
You might have guessed you should go to the Managed identities menu option, but that is for User Managed identities. I'll write more about them another day. The identity we're interested in is a System Managed Identity.
On this properties page, it's actually called the Managed Identity Application ID. Like all other types of Service Principal, Managed Service Identities have an ApplicationID.

Creating a user in Azure SQL Database

I create a user in an Azure SQL Database with the FROM EXTERNAL PROVIDER option. This says that Azure SQL isn't performing the authentication.

The user name is the full name of the ADF. In Azure SQL Database, we can just use the name of the ADF instead of the identity. It will do that for us.

I always make this data factory name lower-case. Once the user is created, I add it to whatever role in the database makes sense. In my example above, I used the role name datafactory_access but there's nothing special about that name.

You need to decide which role to add it to based upon the tasks that the data factory needs to perform. While it's temptingly easy to just add it to db_owner, try to resist that. If in doubt, create a role that's used for data factory access and grant it permissions as they are needed.

Must execute using an AAD-Based Connection

If you just try to execute the statements above, you might find that you get an error message saying:

Msg 33159, Level 16, State 1, Line 8
Principal 'beanperfectiondatafactory' could not be created. Only connections established with Active Directory accounts can create other Active Directory users.

You cannot create any type of Azure AD-based user in an Azure SQL Database, if your connection was authenticated as a SQL login. You must use a connection that was itself made using Azure-AD authentication.

I find that the easiest way to do that, is to make sure I have an Azure Activity Directory Admin assigned for my Azure SQL Server, and then just execute the code right in the Azure Portal. I use the Query Editor tab in the main menu for the Azure SQL Database and connect as that administrator.

 

ADF: Where did "discard all changes" go in Azure Data Factory?

I'm a big fan of Azure Data Factory (ADF), but one of the things you need to get used to with tools like this, is that the UI keeps changing over time. That makes it hard for several reasons:

  • It's hard to train people. Any recorded demo you have will show them things that no longer exist, within a fairly short period of time.
  • Every time a cosmetic change occurs, it immediately devalues blog posts, tutorials, etc. that are out on the Internet.

I think Microsoft don't quite get how much blog posts, etc. supplement their own documentation.

It's the same with Power BI. I used to often teach a Power BI class on a Tuesday, as the second day of our week-long BI Core Skills class. And Tuesday was the day that Power BI changes would appear. So I'd be showing a class how to do something, and suddenly I'm trying to find where a tool I use regularly went.

So even people who work with these tools all the time, keep having odd moments where they go to click something they've used for ages, and it's just not there any more.

I had one of these moments the other day in ADF. I went to click on the menu item for "Discard all changes" and it had vanished. My mistake is that I kept looking through the menus and wondering where it went. I didn't notice that it had become a trash-can icon in the top-right of the screen.

So this is just a post for anyone else who's wondering where it went. (Or for someone following a tutorial or blog post and can't find the menu item).

SDU Podcast #82 with guest Kamil Nowinski now released

I recorded another podcast with an old friend recently. I met Kamil Nowinski in Poland when I was invited to speak at SQL Day. That event is on again this year, but as a virtual event only.

In our consulting work, we use Azure Data Factory frequently. Deploying it, can be a real challenge. Microsoft has a supplied method with ARM templates but that has issues. Kamil produced free tooling that's in the Azure Marketplace and makes it much easier to work with.

In this podcast, Kamil describes the overall process, and how the challenges arise with the Microsoft-supplied mechanism, and the way his tooling makes it easier.

If you want to understand more about Azure Data Factory (ADF) and how it interacts with source control, and deployment, this is a great show to listen to.

You'll find the podcast with Kamil (and all our other podcasts) here: https://podcast.sqldownunder.com

ADF: Keep linked service names the same across environments for Azure Data Factory

I recently recorded a podcast with Kamil Nowinsky. Kamil is well-known for creating some tools that make it easy to publish individual Azure Data Factory pipelines. It's a far better method that the standard ARM (Azure Resource Manager) template deployment that's provided by Microsoft.

However, if you use this method (and I recommend you do), you'll be publishing pipelines separate to the linked services that your datasets connect to.

One mistake I've seen a few clients making lately, is that when they created the linked services in different environments, they did so with slightly different names. Because of that, they needed to modify the deployed objects after deployment. You don't want to be doing that.

Data Factory datasets use the name of the linked service to identify it, so it's really important that when you're creating the linked services that you keep the name of the linked services consistent across the different environments where you'll be deploying pipelines and datasets.

Even though the linked services will be pointing to different servers, etc. in different environments, keeping the names of the linked services consistent will make your life much easier.