Announcing the Fabric Down Under podcast – Show 1 available now

I've mentioned in a previous post how important I think Microsoft Fabric is going to be.

So, in addition to our existing SQL Down Under, and Cosmos Down Under podcasts, we have a brand new podcast called Fabric Down Under.

In the preview for Microsoft Fabric, one of the key participants from the product group has been Pawel Potasinski. I've known Pawel for many years and he's been the driving force behind many important things. It was great to see him driving community around Microsoft Fabric in preview.

So he was the obvious choice for the first podcast in the series.

I hope you enjoy it. You'll find all the shows and details on subscribing at our site: https://fabricdownunder.com

Welcome Microsoft Fabric – Most significant change in Microsoft BI

Microsoft has had a strong lead in BI and analytics for a long time now. The introduction of the tabular data models in 2012 was a watershed moment.

Today, they have announced a private preview of Microsoft Fabric. It's a bigger deal than the tabular model was, and will set the direction of BI tooling for a long time to come.

The opposition was already struggling to keep up, and in fact, hadn't managed to do so. Microsoft Fabric will up the ante much, much higher than where we already were.

What is Fabric?

First up, it's a licensing model. Similar to the way you can license Office as a bundle rather than as individual products, it's a product that includes a set of best of breed tools.

But it's not just a licensing model.

Microsoft Fabric delivers a single integrated experience for working with analytics, with widespread integration between the tools, which are known as experiences. At release, these are the experiences:

Data Integration – will be very familiar to anyone who's worked with Azure Data Factory

Synapse Data Engineering – if you love running Spark or Python notebooks and more, this will make you feel happy

Synapse Data Warehousing – this involves both data warehouses, and lake houses, and provides an interface that's familiar to SQL devs.

Synapse Data Science – want to create and run machine learning models across the data? This is for you.

Synapse Real Time Analytics – need to deal with telemetry, logs, etc. and query them for analytics at volume? This is the tooling you need.

Power BI – is enhanced to work with the new storage mechanism. As well as the previous import and direct query modes, there is now a DirectLake mode.

And I'd expect to see more experiences coming.

One Lake

One of the most critical aspects of Microsoft Fabric is that data for all the experiences is stored in One Lake. This is basically an Azure Data Lake type of storage that is used to hold Delta-Parquet files, but again it's much more than just a storage account.

Having this single copy of the data in a single format, means you can access the same data from within all the different experiences. I can't stress how important this is.

And contrary to what happens with competitors, the file format is a standard open format using Delta-Parquet.

Get Involved !

Importantly, you can now try this too. You'll find free trials as part of the public preview at https://aka.ms/try-fabric. I'd encourage you to get involved.

For a detailed introduction, watch the on-demand session from Microsoft Build: https://build.microsoft.com/en-US/sessions/852ccf38-b07d-4ddc-a9fe-2e57bdaeb613?source=sessions

And make sure to subscribe to our new Fabric Down Under podcast!

Cosmos Down Under podcast 8 with guest Blaize Stewart is now published!

I recently noticed a blog post from Jay Gordon that was calling out interesting content from Blaize Stewart on antipatterns for development with Azure Cosmos DB.

I thought this would be a perfect topic for one of our Cosmos Down Under podcasts and Blaize agreed to take part.

So much content talks about what to do with a product, and even though everyone knows there are things you shouldn't do, very little content talks about these antipatterns. It was great to speak to Blaize about it, and to get his updated thoughts.

You'll find this (and all) shows here: https://podcast.cosmosdownunder.com

ADF: Passing a file attachment via a POST to a REST service using Azure Data Factory

I was recently doing work for a client where I needed to replace some Python code that was being used to upload a file attachment to a REST based service using a POST command. I was trying to do it using a Web Activity in Azure Data Factory. What was surprising was how little information I could find on how to do that.

While I can't post all the code, I did want to post the core parts here in case it helps others, and for my own future reference. The basic steps involve:

  • Put the file into a storage account
  • Read the file into a variable
  • Construct the body of the POST call, including the attachment
  • Send the POST call using a Web Activity
  • Test the REST-based call

Put the file into a storage account

Ideally, the Web Activity would have a way to just point to a file in an Azure Storage Account and get it to attach that file to calls. It doesn't have that option.

What it does have is an option to include a dataset in the body of the call.

Add dataset UI

I was immediately hopeful and tried that, and while it was interesting, it didn't include the data as an attachment. It just appended it into the body of the call.

So let's load up a file into a variable and use the variable to provide part of the body of the web call.

First, I've created a resource group called rest_post_test using the Azure portal, then created a storage account called greglowfileupload. I created a text file called cities.csv with the following contents:

List of cities

And I uploaded it to a container called uploadfiles.

Cities file in a storage container

Read the file into a variable

I created an Azure pipeline called RestUploadTest, and added the following parameters:

Pipeline parameters

The APIKey was provided by the REST service and is used to log on, instead of using other forms of authentication.

The SiteURL is the address that the POST is being sent to. I've used a test site that I'll talk about in a moment.

The ContainerName, FolderName, and FileName are the location of the file in the storage account. (It is not in a subfolder)

All of these values would normally be read from a database, but for simplicity, I've made them parameters to the pipeline.

I then added the following variables:

Pipeline variables

Next I added a linked service StorageFile to connect to the storage account.

Linked service definition

Note that I would normally use a Key Vault for holding key details for the storage account, but I've gone for simplicity in this demo.

To read the file from storage, we can use a Lookup Activity, but it requires a Dataset, so we create the dataset first. I called it StorageTextFile.  I like to make datasets reusable, so I added parameters to it:

Dataset parameters

Then I configured those parameters as dynamic values in the Connection tab:

Dataset connection tab

Dataset connection tab 2

Note that because I used a Delimited Text File as the dataset type, I had to configure a column and row delimiter. I chose values that will not occur in my file. I used the preview option to make sure I could read the data.

In the pipeline, I added a Lookup Activity called Get File Contents, configured the timeout as 20 minutes (not the default of 12 hours !) and set the details as follows:

Get file contents activity

The dataset properties were populated dynamically from the pipeline parameters.

Finally, I added a Set Variable Activity called Set AttachmentFileContents.

Set attachment file contents

The dynamic code for setting the variable was:

I used the output of the Get file contents activity, and got the first row from it. Finally, because I didn't use the first row as a header row, it allocated its own headers. In this case, I have one column called Prop_0.

And then tested the code up to this point. I clicked on the output of the Set AttachmentFileContents Activity and can see the values. (Encoded in JSON in this viewer).

First test ok

All good to here. We have the file loaded into a variable.

Construct the body of the POST call, including the attachment

The next two things I need to configure are the variables that are holding the content type and body for the web request. I put them in variables separately to make it easier to debug.

First, I configured the web request content type variable:

Web request content type

The value was actually:

This is multipart form data, and because the body can contain many values, you need a boundary that's used to separate them.

Then the body values:

Web request body

The value was comprised of:

Note: the positioning of the carriage return/linefeeds in this is critical. You can't just concat them by using '\r' or '\n'. I've found that doesn't work. The line breaks need to be in the dynamic values.

Also note: you'll see the boundary value in here, but the first time I used it, I added a double-dash prefix, and at the end, I used a double-dash prefix and suffix.

Now we have the pieces ready for the Web Activity.

Note on Line Endings

I found this works for most web servers, but some web servers are fussy about how new lines are formed. Some are happy to just get a line feed (LF) character, others want a carriage return and line feed (CRLF).

If you only supply a LF and it wants a CRLF, the error returned is normally one that says you've reached the end of the stream unexpectedly (or similar words). In that case, you'll need to adjust the body to have consistent line endings with CRLF.

The replace() function in ADF works fine, but you can't just specify '\n' or '\r', etc. I've found you need to use the base64 encodings to get what's needed. (Suggestion from fellow MVP Simon Sabin)

They are as follows:

  • Instead of '\r' write base64ToString('DQ==')
  • Instead of '\n' write base64ToString('Cg==')
  • Instead of '\r\n' write base64ToString('DQo=')

Send the POST call using a Web Activity

I then added a Web Activity called Upload the file attachment:

Web activity

Note the Authorization header contained:

That supplies the API key as the bearer.

Note also: you shouldn't need the Accept header but I've found some services complain if it isn't there. The other reason for having it set to application/json is so that when an error occurs, you can read it without having to decode it as you would with gzip compression, etc.

Uploading Larger Files

If you are uploading larger files, there are two things to consider:

  • I'm not sure what the current maximum size is for an ADF variable, but the method that I've used assumes that the whole file will fit in a variable. It's pretty large though.
  • In the advanced settings for the Web Activity, there is a setting for an HTTP Request Timeout. That defaults to 1 minute. If the upload will take longer than that, you'll need to increase it. It has a 10 minute maximum value.

HTTP request timeout

Test the REST-based call

Now we get to test it. What we need is something to catch the requests while we're testing. The easiest site I've found for this is webhook.site. You can set up a site that listens on a URL for free. It's awesome. When you start it, it gives you a temporary URL to use:

That's the URL that I used earlier as a pipeline parameter. Let's test the POST by debugging the pipeline.

First, we make sure the pipeline ran as expected:

When we check the website, we see the request received, and with the file attachment!

Post received

Lower-level debugging

While this site is great when debugging most things, at first I struggled with it. The problem was that when I had malformed requests, I wished that it would just show me the raw data that it was receiving. I needed to compare my requests to other well-formed requests. I asked the authors and it can't do that. (At least not right now)

What did help, is that I found a GitHub project from Janne Mattila, called Echo. You'll find it here: https://github.com/JanneMattila/Echo. It looks like Janne also has a hosted version of it available here.

I decided to build that project and deploy it to an Azure Website, in case I wanted to modify it. And it was so very helpful. It just shows you the detail of everything it receives, rather than formatting the request:

Echo output

Thanks Janne !

I know this has been a longer blog post than normal but I hope it helps someone work this out.

 

MS Tech Summit 2023 – Discounts running out

We have so many great tech friends in Poland. I was pleased to see they are running the MS Tech Summit, and pleased to be speaking at it.

Looks like a great set of sessions : https://mstechsummit.pl/en/ They tell me there are more than 100 sessions spread over 8 tracks.

Not in Poland? Not a problem. It's running in-person and online, and with video on demand.

The early bird discount is still available but about to end.

Love to speak with you at the summit.

ADF: Replacing carriage returns in Data Factory expression language

This one had me stumped lately.

I had an variable in ADF where the value contained carriage returns. I needed to remove them.

I guessed that would be easy:

@replace(variables('fileContent'), '\r', ")

But no matter what variation of this I tried, it wouldn't remove them.

Thanks to the genius of fellow MVP Simon Sabin, the answer was:

@replace(variables('fileContent'),base64ToString('DQ=='),")

Note: if you need a line feed (LF) it's base64ToString('Cg==') and a carriage return / line feed pair (CRLF), is base64ToString('DQo=').

Thanks Simon, and I hope this helps someone else.

Note: if you need a linefeed it's base64ToString('Cg==') and a carriage return / line feed pair, is base64ToString('DQo=').

FIX: Data Factory ODBC linked service fails to Apply and returns Internal Server Error

I was working with a client who has having trouble debugging an ADF pipeline, related to an ODBC linked service not working as expected.

The user had configured the connection string property of an ODBC connection this way:

  •  Set a parameter to the linked service as ServiceDSN
  •  Configured the connection string as @concat('DSN=',linkedService().ServiceDSN)

The Test Connection for that worked fine, but when you click Apply, it fails with an Internal Server Error. Specifically, the error was:

Failed to encrypt linked service credentials on linked self-hosted IR 'XXX-XXXX01' through service bus, reason is: InternalServerError, error message is: Internal Server Error

Absolutely no idea why. Has to be a bug.

Other Symptoms

What it had done though, is leave the linked service connection details in a weird state. Trying to access it via the Test Connection option at the dataset level, showed Data source name too long.

What finally gave me a clue, is that when I looked at the error message in the log on the shared integration runtime, it actually said Data source name too long not found in mapping in connector. Now apart from the lousy English on this one, it's interesting that in the UI, only the first part of the message surfaced. The additional not found part was a major hint. It wasn't finding the entry for the linked service, for the interactive mode used during debugging in the portal.

Solution

Anyway, the solution was to configure the connection string as DSN=@{linkedService().ServiceDSN} instead. That tests just the same, but doesn't throw the internal server error when you try to apply it. And it works as expected.

No idea why the way you construct the connection string matters, as they both return the same string, but it does. Both methods test fine, but one explodes when you try to apply it.

Another Related Error

One other thing I saw periodically during testing was an error that said:

Format of the initialization string does not conform to specification starting at index 0

This error occurs if the connection string happens to just contain the DSN and not the string with the DSN= prefix.

Hope any/all of these help someone else.

Cosmos Down Under podcast 7 with guest Rodrigo Souza is now published!

I was able to record another new Cosmos Down Under podcast today. My guest was Microsoft Senior Program Manager Rodrigo Souza.

In the show, we discussed the Change Data Capture feed for the Analytical store in Azure Cosmos DB. This is a powerful new capability and worth learning about.

I hope you enjoy the show.

https://podcast.cosmosdownunder.com/

 

Book Review – Make Your Data Speak – Alex Kolokolov

Over the last year, I've come to know Alex Kolokolov more, through involvement with his data visualization challenges. I was really pleased to see he'd written his first book Make Your Data Speak (Creating Actionable Data through Excel For Non-Technical Professionals).

Things I Liked

I really liked the conversational style of the book. It's all structured around  an approach of "Let's see how this happens by example". The tone was really refreshing and should be good to hold people's interest.

I also really loved the approach of starting with a (believable) mess and cleaning it up. The level of the book would probably work best for the "not really comfortable with pivot tables yet" audience but I could see it being useful for people who've worked with these types of problems before, and who have probably made all the mistakes that Alex talked about in the book. Alex did say it's for non-technical professionals and I think that's spot on.

Having QR codes for linking to sample code was a simple but nice addition.

I found myself chuckling a bit when reading the  colours and theming parts. When I see demonstrations of material like this, people often demonstrate truly awful colours, and I think "no-one would do that". Alex has chosen examples that are nasty but believable. I've seen worse in the field.

Similarly, the section on choosing visualizations was detailed and well-argued.

The book finishes with a section on improving data-driven culture in the organisation. That's a good way to end.

Things I'd Like to See Improved

Not much. It's great!

It's important to note that the book focusses on Excel. I did keep thinking about how I'd do it in Power BI instead, but there is a very, very big audience for Alex's take on how to tell data stories using Excel.

I was concerned about how the English would be, from a non-native speaker. Some was a little odd, but Alex has done an outstanding job. In fact, it's so much better than so many books I've recently read from native English speakers. I really did not notice errors apart from a few things that sounded grammatically odd to me. As an example, the chapter "Dashboard Assembling" really should be "Dashboard Assembly" or perhaps better "Assembling Dashboards". Some sentences like "Assembly according to the layout is faster and easier…" is strictly correct but sounded a little odd.

I'd have to say though that nothing really grated on me. And that's quite an achievement. I hope one day I can do the same in Mandarin, but I fear I'll be far short of the level of what Alex has achieved in English writing.

The Verdict?

If you need to learn to tell a story with data and using Excel, this would be a worthwhile addition to your library.

7 out of 10

 

SQL: Understanding Change Data Capture for Azure SQL Database – Part 2 – How does it work?

In the part 1 of this series, I discussed the positioning of Change Data Capture. In part 2, I want to cover how it works.

Log Reading

There are many ways that you can output details of changes that occur in data within SQL Server. Many of those methods require actions to occur at the time the data change is made. This can be problematic.

The first problem with this, is the performance impact on the application that's making the change. If I update a row in a table and there is part of the process that writes details of that change to some type of audit or tracking log, I've now increased the work that needs to happen in the context of the application that's making the change. Generally what this means, is that I've slowed the application down by at least doubling the work that needs to be performed. That might not be well-received.

The second potential problem is even nastier. What if the change tracking part of the work fails even though the original update works? If I've done the change tracking work in the context of the original update (particularly if it's done as part of an atomic process), by adding tracking I might have broken the original application. That certainly wouldn't be well-received.

So what to do?

The best answer seems to be to work with the transaction log that's already built into SQL Server. By default, it does have details of the changes that have been occurring to the data. It can be read asynchronously so delays in reading it mostly won't affect the original data changes at all (there are only rare exceptions to this). If the reading of the logs failed, the problem can be corrected and the reading can be restarted, all again without affecting the main updates that are occurring.

And that's what Change Data Capture does. It uses the same log reader agent that has been part of SQL Server for a very long time. Previously though, it was used for Transactional Replication. In fact if you use both Transactional Replication and Change Data Capture on the same SQL Server system, they share the same instance of the log reader. The SQL Server Agent is used to make them both work.

SQL Server Agent – Isn't that missing?

When we're working with Azure SQL Database, things are a bit different. Currently, we don't have any concept of Transactional Replication. That could change but right now, it's not there. So sharing the log reader isn't an issue.

But I also mentioned that with SQL Server, it was the SQL Server Agent that kicks off the log reading agent. And with Azure SQL Database, we don't have SQL Server Agent either !

The Azure SQL Database team have instead provided a scheduler that runs the log reader (called the capture), and also runs the required clean-up tasks. SQL Server had another agent to perform clean-up. This is all automated and requires no maintenance from the user.

Change Data Capture (CDC) Data Flow

The data flow with CDC is basically like the following:

CDC Data Flow

  1. The original app sends a change (insert, update, delete) to a table in the Azure SQL Database.
  2. The change is recorded in the transaction log.
  3. Some time later (usually not long though), the change is read by the capture process and stored in a change table.
  4. The Data Warehouse (DW) or target system or ETL system makes a call to a set of CDC functions to retrieve the changes.

Everything in the dotted box above is part of, and contained within, the Azure SQL Database.

Upcoming

In the next section, I'll show you the code that's required, and show you the changes that occur to the Azure SQL Database when you enable CDC.

  1. Why use Change Data Capture for Azure SQL Database?
  2. How Change Data Capture works in Azure SQL Database
  3. Enabling and using Change Data Capture in Azure SQL Database
  4. Change Data Capture and Azure SQL Database Service Level Objectives
  5. Accessing Change Data Capture Data from Another Azure SQL Database