Fabric Down Under show 8 with guest Heidi Hasting now available!

Another Fabric Down Under podcast is out the door. This time it was with a another fellow "Down Under" guest Heidi Hasting from Adelaide.

Heidi Hasting is a Business Intelligence professional and former software developer with over ten years experience in Microsoft products.  She's an MVP, MCT and holder of many certifications along with being an ALM/DLM enthusiast and Azure DevOps fan and co-founder and organiser of the Adelaide Power BI User Group.

Heidi is a regular attendee at tech events including Azure Bootcamps, DevOps days, SQLSaturdays, Difinity and PASS Summit, and I've seen her so many times at events around the world.

This show was a bunch of fun to record and I think you'll hear that shine through in the recording. We discuss a wide range of Fabric related topics but primarily look at the current situation with source control and DevOps in relation to Microsoft Fabric and related tools like ADF and Power BI.

You'll find this show, along with the previous shows at: https://podcast.fabricdownunder.com

SSIS: Reading pipe delimited text and selecting particular output columns

There was a question on the Q&A forums today, asking how to read data using SSIS, when it's in this format:

One of the concerns was that there was a leading pipe. This is not a problem. When you have data like that, and you set | as the delimiter, because there are 6 delimiters, then there are 7 columns output. These are the values:

Column 1: blank string
Column 2: 101
Column 3: blank string
Column 4: A
Column 5: 21
Column 6: DC
Column 7: blank string

SSIS can handle that easily. Let me show you how.

Also, if you'd like to learn about SSIS from me, we have a great online on-demand course here:

SQL Server Integration Services for Developers and DBAs (sqldownunder.com)

Example SSIS project

I start by creating a new SSIS project. Next in the Control Flow tab, I drag in a Data Flow task, and rename it to Import the Rows.

Control Flow

Right-click the task, and choose the Edit option. (Note, you can also do this via the Advanced Edit button but there's no need for that here)

Edit link

When the data flow task editor appears, drag in a Flat File Source. (Note: make sure it's this one, not the useless Source Assistant at the top)

Flat file source

Rename the flat file source to SampleData and right-click it and choose Edit. The Flat File Source Editor will appear:

Flat File Source Editor

You need to add a new connection, so choose New:

Add source details

My SampleData.txt file looks contains this:

So, here I've added the link to the file name (note: in production code, I would parameterize this), and I've made the Vertical Bar be the Header row delimiter. I've ensured that column names in the first data row is selected, and I don't want to skip any header rows.

Once I've done that, I go to the Columns page:

Columns page

Note that SSIS has already named the four columns that had names in the header row, and put generic names for the others. At this point, it's important to check the OutputColumnWidth property for each column. Then we click OK to save it.

Back on the Flat File Source Editor page, go to the Columns page:

Columns page

Note here that I can then uncheck the columns that I don't want, at the top:

Uncheck columns

Once I click OK, I now have configured the source as required.

Testing

The easiest way to test a source is to connect it to something, and add a viewer to the data path.

I find the easiest for this is to use a Multicast. You can connect to it, and it doesn't care if there are no destinations. (Unlike real destinations)

Multicast

So I drag a Multicast onto the surface and connect the data path to it. Then, right-click the data path (the line between them), and choose to Enable Data Viewer.

Enable data viewer

Then click Start and you'll see the data as expected:

Expected data

 

We have four output columns with the correct names as expected.

I hope that helps someone get started with this.

 

 

Book Review: Azure Data Factory Cookbook (Second Edition)

A few weeks ago, I received another book from our friends at PackT. It was the second edition of Azure Data Factory Cookbook by Dmitry Foshin, Dmitry Anoshin, Tonya Chernyshova, and Xenia Ireton.

I liked the earlier version of this book, and I was pleased to see substantial work had gone into this second edition.

It's interesting that the book doesn't stick directly to Azure Data Factory (ADF) but also branches into topics on Azure Synapse Analytics, Microsoft Fabric, and Databricks.

Cookbook books

I'm a fan of cookbook style books. In many cases, readers already understand the basics, and instead of a detailed coverage of all the material from a learning perspective, what they need is a cookbook that says "if you're trying to do this, try this". That's a very different approach.

I first saw it used widely in the books by Art Tennick. He had cookbook style offerings for MDX and DAX that I really enjoyed. I used to say that I had a lot of friends who dabbled in MDX but I think I only knew about 4 people who really "thought" in MDX. And rather than learning MDX from scratch, as a standalone concept and in depth, people often just needed examples of how to achieve particular outcomes. It is often the same for DAX today.

Throughout this book, there are "Getting Ready", "How to do it", "How it works" and sometimes "There's more". This is provided for all the things you try to do. I liked that structure.

Production

Overall, the book was good quality.

Like many current books, it struggles with the size of images. The tools don't lend themselves to narrow portrait-style screenshots, so it's hard for the authors to deal with that. This screenshot isn't too bad (others are much harder) but even a shot like this would be difficult for many readers:

ADF cookbook small writing

The other thing that I'm not a fan of with screenshots, is the use of dark mode themes. I don't think they work well in books:

Dark mode text

Content

It's a solid read, as it's around 500 pages. It provides very clear detail on how to load and transform data using ADF. I was pleased to see some coverage of custom activities.

Unlike most books, there was nothing that I really objected to in how the code was created.

The only criticism that I'd make is that I was hoping for more depth on using ADF, given the title of the book. While it's pleasing to see the breadth of coverage of other topics in the book, with everything from AI services, to machine learning, to logic apps, and functions, I would have liked to have seen more content that directly matched the title of the book.

I use ADF daily and there are so many things I've learned about how to really use it and how to get things done that aren't mentioned. ADF is one of those tools where some days I can't believe how much I've been able to achieve, and other days I want to throw my mouse through the screen.

I would like to have seen more concentration on how to get around things that are frustrating. That would particularly help newcomers.

There are good examples of how to use ADF's expression syntax spread throughout the book. I suspect it's an area that really could warrant a chapter of its own.

Summary

Overall, I really liked the book. I would like to have seen more depth on ADF (to match the book title) at the expense of removing some breadth.

7 out of 10

 

 

Book Review: Data Cleaning with Power BI

I was excited to see Gus Frazer's new book on Data Cleaning with Power BI. Our friends at PackT sent a copy for me to take a look at.

Gus has a background with both Power BI and Tableau, and it's always interesting to see a mix of perspectives in any book. In this book, he shows a variety of data cleaning/cleansing tasks, then how to do many of them with M in Power Query. And you can tell Gus has a teaching background. because chapters have review questions. The book could well support a course.

Content I found interesting

In later chapters, he gets into fuzzy matching , fill down, and using R and Python scripts, along with how to use ML to clean data. Gus has added quite good coverage of how to create custom functions in M, and importantly, how to optimize queries in M.

The only comment that I think was missing, is that I try to avoid doing this work in M if I can. I'm a fan of pushing this work upstream wherever possible. Given the topic of the book, I know that's a tall order, but it needs to be mentioned. Power Query and M aren't the strongest contenders when you're dealing with large amounts of data that needs to be manipulated. He did have a section on dealing with Big Data, and that's where I'd most collide with using Power Query and M. The bigger the data, the less likely I'd be wanting to use these tools to do the work.

The book provides some info on data modeling, and why things like bidirectional cross filtering is such an issue. It was good to see that, given how much of my life I spend removing it from models built by clients.

I liked the coverage of calculation groups.

There was a section on preparing data for paginated reports, and then one on cleaning data using Power Automate. This is another area where I'd differ. The problem with Power Automate is the licensing model. I see far too many solutions built using it, that break when the author is no longer with the company. For anything that the organization depends upon, I'd be far more likely to use Logic Apps running under a service account, than Power Automate.

I really liked seeing a section on how OpenAI makes it easier.

Things I'd like to see

This is a good book but there are some things I'd like to see.

Screenshot images

The book is pretty good quality. What I did find hard was that many screenshots were simply too small. I know it's hard to squash Power BI UI into a portrait-oriented book, but it needs further thought. Here's an example of what I'm talking about:

Tiny writing in the advanced editor

That's probably OK in a PDF version of an eBook where you can zoom in, but I struggle to imagine how that would appear in a printed book. I know it's not easy, and it's an issue I've run into myself when writing.

Calculated columns

While it's important that calculated columns are shown, I really try to avoid them in these models, and wherever possible, I like to see that "pushed left" i.e., back to the source.

Data types

Some of the biggest issues with data transformation and cleansing relate to data types. Power Query and M are fairly poor at working with various data types. In fact, it's a criticism I have of Power BI as a whole.  For example, whoever decided that "decimal" should really mean "float" so we end up with columns of numbers that don't always add up, etc. simply made a poor decision in the design of the product.  These types of things, and other aspects of data types really need much stronger coverage in the book.

Summary

A good book, and well-written.

7/10

 

Book Review: Extending Power BI with Python and R

I've seen a few books lately from the PackT people. The latest was the second edition of Extending Power BI with Python and R by Luca Zavarella: Perform advanced analysis using the power of analytical languages.

Author

Luca Zavarella

The author is Luca Zavarella. I've been working with Power BI since before it was released, and ever since I've seen discussions around using R (initially) and Python (later), Luca has been one of those people that everyone listens to.

Luca is a fellow Microsoft MVP and has a great background in data science. In recent times, he seems to have been focussing on advanced analytics, data science, and AI. He's been prolific in these areas.

The Book

This is another very comprehensive book and took quite a while to read. It will be a vital reference for anyone trying to apply Python and/or R in Power BI. I've heard many argue that this is a pretty niche topic but I can't say I agree. I deal with a real mixture of clients and while not everyone does this, quite a few do. That of course particularly applies to anyone from a data science background.

In the book, Luca's experience shines through.

The real power of the book is the way that Luca shows you how to do things with Python and/or R that most people working in Power BI without them, would think were impossible.

In the DBA world, people often talk about the "accidental DBA" i.e., that's someone who wasn't planning to be a DBA but ended up being the one doing the tasks. I can well imagine that if you are the "accidental data scientist" working with analytics, there's a great amount to learn from this book.

I could also imagine this book being useful to people who use other analytic products, not just Power BI. because Luca explains the theory, not just the practical application.

And even though I've worked with Power BI from the start, I found a few interesting things in the book about Power BI, not just about using Python and R with it. It's important to always just keep learning, and every time I read something about a product that I think I already know well, I invariably learn something anyway. That's often because everyone uses the products differently, or applies them to problems that you hadn't considered.

Summary

Another great book. And another one where I can't imagine how long it must have taken to write. And the second edition adds great value.

It's not one for all my usual data audience, but anyone with an interest in applying advanced analytics in Power BI should check it out. You will learn a lot.

9 out of 10

 

Power BI Implementation Models for Enterprises Part 3: Cloud Friendly Clients

I've been writing a series on Power BI Implementation Models for Enterprises for https://tekkigurus.com.

Part 3 that covers what I consider Cloud Friendly Clients is now published:

https://www.tekkigurus.com/power-bi-implementation-models-part-3-cloud-friendly-clients/

Enjoy !

SQL Down Under show 85 with guest Bob Duffy discussing building data warehouses is now published

And another new SQL Down Under show is published!

Once again, I had the great pleasure yesterday to record a podcast with one of my UK based friends Bob Duffy.

Bob is a Principal Data Architect at Prodata's SQL Centre of Excellence in Dublin, Ireland. He helps enterprise customers build large scale enterprise data warehouses in Azure, typically integrating ERP and financial systems.

Bob is a long-term Data Platform MVP, a Microsoft Certified Architect, Master, and Analysis Services Maestro. He holds a number of Azure role-based certifications. Bob also specialises in performance tuning, architecture, load testing, DevOps, Software engineering, MQ architecture, semantic modelling and meta data driven design on the MS platform.

Bob used to work for Microsoft Consulting Services a Senior Consultant and subject matter expert for BI/data in Western Europe in 2005.

In the show, we discuss the approaches he takes to building data warehouses on the Microsoft BI platforms.

I hope you enjoy it. You'll find this show (and previous shows) here: https://podcast.sqldownunder.com/

SQL: Understanding Change Data Capture for Azure SQL Database – Part 4 – Azure SQL Database Service Level Objectives

This is part 4 of a series on working with change data capture (CDC) in Azure SQL Database. This part discusses how to enable it and how to use it.

When I was reading the documentation for CDC in Azure SQL Database, I kept coming across a mention that it required at least an S3 service level objective (SLO), if you were using a DTU-based database.

I really hoped that wasn't the case.

I was left saying "please say it ain't so!".

Testing

I thought it was time to find out what actually happens when you try it.

I started by creating a database with the lowest SLO (Basic). Now I know these don't hold enough data to really even be very meaningful, but I wanted to know what would happen.

I didn't get far.

I created the table from the scripts in part 3 of this series, and then tried to do the first enable of CDC.

To say that's disappointing is an understatement. But i tmakes it clear that Basic, S0, S1, and S2 aren't going to let you enable it.

What about Scaling?

Then I wondered what would happen if I enabled it on an S3, and then tried to scale down.

So I scaled the DB to S3 ok, and enabled CDC as in part 3. All worked as expected.

I then scaled the DB back to S2 using the Azure Portal. And alas, that failed too:

I suppose I'm not surprised, but I had hoped it might at least have automatically disabled CDC if it really had to.

It's a Mistake

I think these licensing restrictions for CDC in Azure SQL DB are a mistake.

Unhappy woman
Unsplash image from Getty

Back in SQL Server 2016 SP1, I was one of the main instigators to get the product team to make sure that the programming surface of the database worked across all the tiers (Enterprise, Standard, Developer, Express) wherever possible.

This restriction on CDC goes directly against that ethos. It harks back again to large single database thinking, not cloudy thinking.

I understand that it's probably needed for performance in production scenarios, but what I'm always trying to tell the team is that production scenarios aren't the only scenarios.

Developers also need to write code. We should let them write and test functionality. It can be clear to them that it won't work the same as on a production instance, but that's no different to many other aspects of the system. A Basic SLO database won't run a complex query over a lot of data either, but we don't expect it to.

But it's important to let them write their code and make sure it basically functions, no matter what level of SLO they're working with. If you require them all to have an expensive database, just to test code or concepts, you greatly limit how many developers will use the features. 

Clouds
Unsplash image by Dallas Reedy

Having those types of restrictions also restricts the cloudiness of the solution, in that you couldn't scale up/down across a range of SLOs. We often take S7 or S8 databases and temporarily "park" them at S2 while not in use. This type of restriction kills those options as well, and makes the overall solution less cloudy.

And One More Part to This Series

That's service level objectives for the use of CDC in Azure SQL DB. We'll move to accessing CDC data from another Azure SQL DB next.

  1. Why use Change Data Capture for Azure SQL Database?
  2. How Change Data Capture works in Azure SQL Database
  3. Enabling and using Change Data Capture in Azure SQL Database
  4. Change Data Capture and Azure SQL Database Service Level Objectives
  5. Accessing Change Data Capture Data from Another Azure SQL Database

 

Azure Data Factory (ADF) – Issues with parsing strings in CSVs

It's scary how much of the world's data lives in CSVs. Yet, as a standard, it's problematic. So many implementations just don't work as expected. Today, I had a client asking about why Azure Data Factory wouldn't read a CSV that, while odd, was in a valid format.

The simplest equivalent of the file that wouldn't load, would be one like this:

There are meant to be four columns. The source system wrapped strings in quotes only when the string contained a comma, as that was the delimiter for the file.

But the fourth line in that file would not load. ADF would split it across multiple columns. It was not interpreting the double-double-quotes.

I was puzzled at first, as the file seemed ok. If I upload that file an any of the common online CSV file validators or linters, they all said it was ok.

The client had configured the dataset as follows:

The options seemed ok to them.

Turns out the issue was with the escape character. ADF defaulted to using a backslash. For the majority of CSV files that I see, it would need to be a double-quote:

And sure enough, with that change in place, it loaded files similar to the one above without issue.

I don't know why that's not the default for CSV files in ADF but I do hope it helps someone else.

Fabric Down Under show 3 with guest Mark Pryce-Maher discussing SQL Warehousing in Microsoft Fabric

I had the great pleasure to get to spend time today, talking with Mark Pryce-Maker. We were discussing the SQL warehousing experience in Microsoft Fabric.

Mark is a Senior Program Manager at Microsoft, where he's working closely with the Microsoft Fabric team. He has a strong background in many areas of data, most recently prior to Fabric, with helping customers migrate to Azure Synapse Analytics. He has worked extensively with the dedicated SQL pools in that service.

SQL is one of the core experiences for data engineers in Microsoft Fabric. In an upcoming show, I'll discuss the other core experience with Spark notebooks.

In the show, Josh provides a solid introduction to what is present, and we dived into many areas of how it works, and how you might use it.

You'll find the show here: https://fabricdownunder.com

I hope you find it useful.