SSIS: Reading pipe delimited text and selecting particular output columns

There was a question on the Q&A forums today, asking how to read data using SSIS, when it's in this format:

One of the concerns was that there was a leading pipe. This is not a problem. When you have data like that, and you set | as the delimiter, because there are 6 delimiters, then there are 7 columns output. These are the values:

Column 1: blank string
Column 2: 101
Column 3: blank string
Column 4: A
Column 5: 21
Column 6: DC
Column 7: blank string

SSIS can handle that easily. Let me show you how.

Also, if you'd like to learn about SSIS from me, we have a great online on-demand course here:

SQL Server Integration Services for Developers and DBAs (sqldownunder.com)

Example SSIS project

I start by creating a new SSIS project. Next in the Control Flow tab, I drag in a Data Flow task, and rename it to Import the Rows.

Control Flow

Right-click the task, and choose the Edit option. (Note, you can also do this via the Advanced Edit button but there's no need for that here)

Edit link

When the data flow task editor appears, drag in a Flat File Source. (Note: make sure it's this one, not the useless Source Assistant at the top)

Flat file source

Rename the flat file source to SampleData and right-click it and choose Edit. The Flat File Source Editor will appear:

Flat File Source Editor

You need to add a new connection, so choose New:

Add source details

My SampleData.txt file looks contains this:

So, here I've added the link to the file name (note: in production code, I would parameterize this), and I've made the Vertical Bar be the Header row delimiter. I've ensured that column names in the first data row is selected, and I don't want to skip any header rows.

Once I've done that, I go to the Columns page:

Columns page

Note that SSIS has already named the four columns that had names in the header row, and put generic names for the others. At this point, it's important to check the OutputColumnWidth property for each column. Then we click OK to save it.

Back on the Flat File Source Editor page, go to the Columns page:

Columns page

Note here that I can then uncheck the columns that I don't want, at the top:

Uncheck columns

Once I click OK, I now have configured the source as required.

Testing

The easiest way to test a source is to connect it to something, and add a viewer to the data path.

I find the easiest for this is to use a Multicast. You can connect to it, and it doesn't care if there are no destinations. (Unlike real destinations)

Multicast

So I drag a Multicast onto the surface and connect the data path to it. Then, right-click the data path (the line between them), and choose to Enable Data Viewer.

Enable data viewer

Then click Start and you'll see the data as expected:

Expected data

 

We have four output columns with the correct names as expected.

I hope that helps someone get started with this.

 

 

Book Review: Azure Data Factory Cookbook (Second Edition)

A few weeks ago, I received another book from our friends at PackT. It was the second edition of Azure Data Factory Cookbook by Dmitry Foshin, Dmitry Anoshin, Tonya Chernyshova, and Xenia Ireton.

I liked the earlier version of this book, and I was pleased to see substantial work had gone into this second edition.

It's interesting that the book doesn't stick directly to Azure Data Factory (ADF) but also branches into topics on Azure Synapse Analytics, Microsoft Fabric, and Databricks.

Cookbook books

I'm a fan of cookbook style books. In many cases, readers already understand the basics, and instead of a detailed coverage of all the material from a learning perspective, what they need is a cookbook that says "if you're trying to do this, try this". That's a very different approach.

I first saw it used widely in the books by Art Tennick. He had cookbook style offerings for MDX and DAX that I really enjoyed. I used to say that I had a lot of friends who dabbled in MDX but I think I only knew about 4 people who really "thought" in MDX. And rather than learning MDX from scratch, as a standalone concept and in depth, people often just needed examples of how to achieve particular outcomes. It is often the same for DAX today.

Throughout this book, there are "Getting Ready", "How to do it", "How it works" and sometimes "There's more". This is provided for all the things you try to do. I liked that structure.

Production

Overall, the book was good quality.

Like many current books, it struggles with the size of images. The tools don't lend themselves to narrow portrait-style screenshots, so it's hard for the authors to deal with that. This screenshot isn't too bad (others are much harder) but even a shot like this would be difficult for many readers:

ADF cookbook small writing

The other thing that I'm not a fan of with screenshots, is the use of dark mode themes. I don't think they work well in books:

Dark mode text

Content

It's a solid read, as it's around 500 pages. It provides very clear detail on how to load and transform data using ADF. I was pleased to see some coverage of custom activities.

Unlike most books, there was nothing that I really objected to in how the code was created.

The only criticism that I'd make is that I was hoping for more depth on using ADF, given the title of the book. While it's pleasing to see the breadth of coverage of other topics in the book, with everything from AI services, to machine learning, to logic apps, and functions, I would have liked to have seen more content that directly matched the title of the book.

I use ADF daily and there are so many things I've learned about how to really use it and how to get things done that aren't mentioned. ADF is one of those tools where some days I can't believe how much I've been able to achieve, and other days I want to throw my mouse through the screen.

I would like to have seen more concentration on how to get around things that are frustrating. That would particularly help newcomers.

There are good examples of how to use ADF's expression syntax spread throughout the book. I suspect it's an area that really could warrant a chapter of its own.

Summary

Overall, I really liked the book. I would like to have seen more depth on ADF (to match the book title) at the expense of removing some breadth.

7 out of 10

 

 

Book Review: SQL Query Design Patterns and Best Practices

Another book that I read recently was SQL Query Design Patterns and Best Practices Book Cover by Steve Hughes, Dennis Neer, Dr Ram Babu Singh, Shabbir H Mala, Leslie Andrews, and Chi Zhang. It also came from our friends at PackT publishing.

Purpose

The intro to the book says "This book is for the SQL developer who is ready to take their query development skills to the next level. This includes report writers, data scientists, or similar data gatherers and allows users to expand their skills for complex querying and build more efficient and performant queries."

Worth noting that we have a much more detailed online course that covers these topics. You'll find it here:

Advanced T-SQL for Developers and DBAs (sqldownunder.com)

The book does cover a lot of good common sense regarding writing queries, like not returning columns or rows in queries when you don't need them.

There were many parts that I really liked. For example, the JSON chapter was pretty good. It was also good to see a chapter covering T-SQL notebooks in Azure Data Studio.

It was interesting to see a security chapter in the book but I found it was very lacking in content.

Multiple Authors and Editing

I get nervous whenever I see a book with a lot of authors. You know that unless both the general and tech editing are exemplary, that you'll end up with a book that looks like it's written by a lot of people, and not a single thought process.

I used to find it amusing when Wrox used to publish books with all the authors on the cover like this:

Book with 17 authors

I think they thought is was something to be proud of, but I used to find an inverse relationship between the quality of the book and the number of pictures on the cover. When a book has 17 authors, I'll just pass thanks, unless the book is a intended as a compendium like our old MVP Deep Dives books.

So yes, I've seen much worse, but even with (just) six authors, this book does suffer from inconsistencies, particularly in coding style that I don't think would have come from a single author. Just a few examples:

  • I struggled with the way they slipped in and out of quoted identifiers, sometimes using [] around identifiers, sometimes " and other times "".
  • There was completely inconsistent use of statement terminators.
  • There was a weird mix of capitalization and formatting in queries.

There are also general tech editing issues like talking about "database scope credentials" instead of "database scoped credentials". I also noticed spaces inadvertently inserted into the middle of object names in places where they shouldn't be (like a space after a DB name and before the period that starts the schema name).

Improvements I'd like to see

I kept seeing things that while I'm glad the topics were there, I did wish they were done better. Here are some examples:

  • Objects were created with single part names, and then later referred to with two-part names. This can lead to unreliable code, as the outcome depends upon who runs the code.
  • I wish they hadn't used three part table names so much.
  • I really wish they'd used the modern system catalog views rather than the old information_schema views. They came out in 2005.
  • I didn't like see them using varchar constants in WHERE clauses for nvarchar columns. Similar issue with concatenating varchar and nvarchar literals and values inconsistently.
  • I'm not in love with the FORMAT function as much as the authors, given the performance impacts I see from it all the time. It's a .NET function and I'd rather use an intrinsic function wherever I can to replace it.
  • There was a discussion that compares ISNULL and COALESCE yet it didn't mention one of the core differences: the data type returned.
  • CTEs were mentioned but I was suprised to see no discussion around potential misuse and/or performance problems, particularly given the name of the book. I was also surprised that more important use cases like multiple references to the same CTE weren't covered.
  • I'm not sure when most of the writing was done, but the section on window functions completely ignores the super-useful named windows that were added in 2022.
  • There were a few places in the query plans and optimization chapter where I thought: "That's not quite how that works", particularly with things like how plans are chosen, the difference between estimated and actual plans, etc.
  • The real limitations around filtered indexes aren't discussed.
  • I would like to have seen external tables discussed for purposes other than just reading files.

Summary

This book left me with a mix of feelings but I applaud the authors for tackling these topics in the first place.

5/10

 

Book Review: Data Cleaning with Power BI

I was excited to see Gus Frazer's new book on Data Cleaning with Power BI. Our friends at PackT sent a copy for me to take a look at.

Gus has a background with both Power BI and Tableau, and it's always interesting to see a mix of perspectives in any book. In this book, he shows a variety of data cleaning/cleansing tasks, then how to do many of them with M in Power Query. And you can tell Gus has a teaching background. because chapters have review questions. The book could well support a course.

Content I found interesting

In later chapters, he gets into fuzzy matching , fill down, and using R and Python scripts, along with how to use ML to clean data. Gus has added quite good coverage of how to create custom functions in M, and importantly, how to optimize queries in M.

The only comment that I think was missing, is that I try to avoid doing this work in M if I can. I'm a fan of pushing this work upstream wherever possible. Given the topic of the book, I know that's a tall order, but it needs to be mentioned. Power Query and M aren't the strongest contenders when you're dealing with large amounts of data that needs to be manipulated. He did have a section on dealing with Big Data, and that's where I'd most collide with using Power Query and M. The bigger the data, the less likely I'd be wanting to use these tools to do the work.

The book provides some info on data modeling, and why things like bidirectional cross filtering is such an issue. It was good to see that, given how much of my life I spend removing it from models built by clients.

I liked the coverage of calculation groups.

There was a section on preparing data for paginated reports, and then one on cleaning data using Power Automate. This is another area where I'd differ. The problem with Power Automate is the licensing model. I see far too many solutions built using it, that break when the author is no longer with the company. For anything that the organization depends upon, I'd be far more likely to use Logic Apps running under a service account, than Power Automate.

I really liked seeing a section on how OpenAI makes it easier.

Things I'd like to see

This is a good book but there are some things I'd like to see.

Screenshot images

The book is pretty good quality. What I did find hard was that many screenshots were simply too small. I know it's hard to squash Power BI UI into a portrait-oriented book, but it needs further thought. Here's an example of what I'm talking about:

Tiny writing in the advanced editor

That's probably OK in a PDF version of an eBook where you can zoom in, but I struggle to imagine how that would appear in a printed book. I know it's not easy, and it's an issue I've run into myself when writing.

Calculated columns

While it's important that calculated columns are shown, I really try to avoid them in these models, and wherever possible, I like to see that "pushed left" i.e., back to the source.

Data types

Some of the biggest issues with data transformation and cleansing relate to data types. Power Query and M are fairly poor at working with various data types. In fact, it's a criticism I have of Power BI as a whole.  For example, whoever decided that "decimal" should really mean "float" so we end up with columns of numbers that don't always add up, etc. simply made a poor decision in the design of the product.  These types of things, and other aspects of data types really need much stronger coverage in the book.

Summary

A good book, and well-written.

7/10

 

SQL Down Under show 91 with guest Mohamed Kabiruddin discussing SQL Server on Google Cloud is now published!

I hadn't seen Mohamed Kabiruddin for a while. He used to be Australian based, and worked for Microsoft for some time. Now he's a product manager at Google. He's always fascinating to chat to and so I was so pleased to have him on a SQL Down Under podcast today.

Mohamed leads Cloud SQL for SQL Server working with the engineering team to deliver features and capabilities for enterprises to run their SQL Server workloads on Google Cloud SQL. His expertise lies in architecting globally distributed data and information retrieval systems, databases, cloud database migrations, and vector embeddings.

Mohamed spent several years working on applications with SQL Server, is a frequent speaker at industry conferences, meetups and user groups. He is an active community evangelist for SQL Server and a mentor for aspiring product managers.

I hope you enjoy it. You'll find this show (and previous shows) here: https://podcast.sqldownunder.com/

Book Review: Extending Power BI with Python and R

I've seen a few books lately from the PackT people. The latest was the second edition of Extending Power BI with Python and R by Luca Zavarella: Perform advanced analysis using the power of analytical languages.

Author

Luca Zavarella

The author is Luca Zavarella. I've been working with Power BI since before it was released, and ever since I've seen discussions around using R (initially) and Python (later), Luca has been one of those people that everyone listens to.

Luca is a fellow Microsoft MVP and has a great background in data science. In recent times, he seems to have been focussing on advanced analytics, data science, and AI. He's been prolific in these areas.

The Book

This is another very comprehensive book and took quite a while to read. It will be a vital reference for anyone trying to apply Python and/or R in Power BI. I've heard many argue that this is a pretty niche topic but I can't say I agree. I deal with a real mixture of clients and while not everyone does this, quite a few do. That of course particularly applies to anyone from a data science background.

In the book, Luca's experience shines through.

The real power of the book is the way that Luca shows you how to do things with Python and/or R that most people working in Power BI without them, would think were impossible.

In the DBA world, people often talk about the "accidental DBA" i.e., that's someone who wasn't planning to be a DBA but ended up being the one doing the tasks. I can well imagine that if you are the "accidental data scientist" working with analytics, there's a great amount to learn from this book.

I could also imagine this book being useful to people who use other analytic products, not just Power BI. because Luca explains the theory, not just the practical application.

And even though I've worked with Power BI from the start, I found a few interesting things in the book about Power BI, not just about using Python and R with it. It's important to always just keep learning, and every time I read something about a product that I think I already know well, I invariably learn something anyway. That's often because everyone uses the products differently, or applies them to problems that you hadn't considered.

Summary

Another great book. And another one where I can't imagine how long it must have taken to write. And the second edition adds great value.

It's not one for all my usual data audience, but anyone with an interest in applying advanced analytics in Power BI should check it out. You will learn a lot.

9 out of 10

 

PG Down Under show 2 with guest John Miner is now published!

Welcome to show 2 for PG Down Under!

I really enjoyed recording today's show with John Miner. John is a data architect at Insight in the USA. Over many years, John was a data platform MVP, is a strong community contributor, and blogs at craftydba.com.

After I created the Wide World Importers sample databases for Microsoft back in 2016, I did also do a migration of the code to PostgreSQL. That didn't get released for quite a while but I've now seen it out there in the wild. I wanted to do a show on the migration but was concerned that I had done it so long ago, that I might have forgotten what surprised me at the time.

John has had a project in the last year that required migrating from SQL Server databases to PostgreSQL. I thought that made him the perfect guest for this topic. He's knowledgeable about SQL Server and able to share his experiences and what he learned while migrating code.

He's provided a link to his presentation details (including a PowerPoint presentation) here: https://github.com/JohnMiner3/community-work/tree/master/moving-to-postgresql

You'll find this show along with all PG Down Under shows here:

https://pgdownunder.com/podcast

 

SQL Server Management Studio issues with Central Management Servers in v20.1

I've recently been doing work with a site that makes extensive use of Central Management Servers. And that's an issue if you upgrade past v19.3 of SSMS.

Central Management Servers

If you haven't used these, it's the Registered Servers window that you can open in SSMS. (View -> Registered Servers) What it lets you do is set up groups of servers, and execute multi-server queries against all the servers in the group. You can also have a hierarchy of groups with nested folders and then work at different group levels.

The details for these are all these groups and servers are stored in tables in the msdb database. In fact, it's easy to programmatically add/remove servers from these lists. We do that all the time.

And if you haven't used them, they are quite magical. It's awesome to be able to easily run a query against a large number of servers, and to combine the results into a single result grid in SSMS, along with the server name that each row of data came from.

What it has never really dealt with though, is any form of authentication except Windows authentication. (i.e., no SQL logins)

So what's changed?

In v20 and later of SSMS, you'll notice a new entry on the first page of the connection dialog.

Trust server certificate dialog

The options for Connection Security have moved to the first page, but more importantly, the defaults have changed. The default option now (Mandatory) requires you to either have a certificate on the server that you already trust, or to check the box to trust the self-signed certificate on the server.

By having a certificate that you trust, I mean that SQL Server has a certificate installed on it manually, and that certificate is issued by a publisher that your client computer has in its list of trusted publishers. You need to install a certificate on each SQL Server, and get a trusted publisher pushed out to every client, if the certificate is from a publisher that the clients don't already trust.

So what's the challenge for Central Management Servers (CMS)?

I totally understand the desire to tighten up the security, but it's an issue for people using Central Management Servers and not already having trusted certificates in place.

As an example, the people I'm helping at the moment have hundreds of servers, all related to different business groups, and changing any one of them requires change processes with change requests, approvals, external system management teams to implement it, etc. Bottom line is that even though it would be desirable, it's not going to happen any time soon.

There is no option with CMS to automatically select an option to trust the server certificates. You need to do that server by server, and it's stored on a per-user basis, not as a value in msdb. That means that even if I went through and edited every server's configuration, every other person using the CMS would need to do the same.

So how do I work around it?

Bottom line is that there's no easy fix for this. Clearly, installing trusted certificates on every server, and pushing out the certificate authority as a trusted publisher to every client is the desirable outcome.

If you can't do that, you need to keep using v19.3 of SSMS, at least for now.

 

 

 

 

SQL Down Under show 90 with guest Joe Sack discussing Azure SQL Database and Copilot is now published!

Joe Sack is an old friend. (I've known him a long time. He's not actually old). He's always fascinating to chat to and so I was so pleased to have him on another SQL Down Under podcast today. Last time was in 2017.

Joe is a Senior Product Manager at Microsoft and is working with the integration of data and AI with SQL Copilots. The aim is to use AI and natural language processing to make database management simpler, whether you've been doing it for years or just getting started. The experiences he discusses in the show include Azure Copilot integration and Natural language to SQL.

I hope you enjoy it. You'll find this show (and previous shows) here: https://podcast.sqldownunder.com/

Book Review: Web API Development with ASP.NET Core 8

Another book that I was recently sent for review by the PackT people was Web API Development with ASP.NET Core 8: Learn techniques, patterns, and tools for building high-performance, robust, and scalable web APIs.

Author

The author is Xiaodi Yan, who is a fellow longer-term Microsoft MVP and an experienced software developer, focussing on .NET, AI, DevOps and all things cloud-related. You'll also find him on LinkedIn Learning as an instructor.

During the Covid lockdowns, I started to get involved in attending the New Zealand Chinese IT user group meetings that were being held online. I thought it would be a great opportunity to keep trying to improve my Mandarin, particularly with technical content. Xiaodi Yan was one of the people who made me feel most welcome.

The Book

This is a very comprehensive book. I remember the author posting about his relief at having finished it, and I now know why. It's a huge book and it took me a while to go through it. Even though I did so, I think this will be a great ongoing reference in the future when I work on projects that are implementing APIs this way. The structure of the book lends itself to being a great ongoing reference.

I liked to see so many code examples. They really helped with understanding the points that he was making.

One area that I've always felt a bit weak on is dependency injection. I'm a developer by background but spend most of my days working in data, and dependency injection was one of those patterns that came along when I was less focussed on it. He explained it so well.

Lately, with any projects I've been involved with, .NET Core is usually the target. I particularly loved the fact that that's what he targeted in this book.

Like many books, there are a few further edits that might have helped. But overall, it's a masterful work.

Summary

What a great book. I can't imagine how long it must have taken to write. And how long it must have taken to get all the required thoughts in the appropriate structure and order to make it such a good reference.

It's not one for my usual data audience, but anyone involved with Web API development should check it out.

9 out of 10