Opinion: Do your bulk data manipulation in T-SQL, not in row operations in SSIS

I really love SQL Server Integration Services (SSIS). In fact, I wish Microsoft was giving it far more love than it is. So many of my clients use it, but unfortunately, not everyone uses it well, or for the right things.

One really common mistake is to perform lots of row-by-row logic within the SSIS packages. Let's take an example of loading a fact table in a data warehouse as an example:

To load a fact table in a data warehouse that's based on a star schema, you need to find the dimension (usually surrogate) key for each dimension that's associated with the fact.

The Wrong Way

Awesome image by David von Diemar

Now clearly you could use a Lookup task in SSIS to go and find the first dimension key, based upon the dimension business ID. If it's present, you add it to your data flow. If it's not present, perhaps you infer a new row, and add the new row's dimension key to the data flow.

Then you go and do it all again for the next dimension, and so on, and so on.

There are situations where this might make sense, but most of the time, it doesn't.

A Better Way

As an alternative, you could use SSIS to just orchestrate some T-SQL to do the same job:

  • Stage the fact data in a table that has the required columns, plus a nullable column for every dimension key
  • Perform a single update on that staging table, with left joins to look up all the dimension keys and record them when present
  • Perform a single query for each dimension where you need to infer new keys and update the staging table
  • Push the staged data into the target fact table.

The difference in performance is often very, very large.

I know that doing the T-SQL work requires a different skill set to what's required for the SSIS work, but whenever you're dealing with a large amount of data manipulation in SSIS, or now in ADF (Azure Data Factory), consider whether you should just be orchestrating T-SQL instead.

 

4 thoughts on “Opinion: Do your bulk data manipulation in T-SQL, not in row operations in SSIS”

  1. Why use SSIS or ADF at all when your staging data is already in the same database as your dom/fact (beit on different schema of course – also cheaper in the cloud). Put the logic in stored procedures and use set based logic only. Easy to code, easy to maintain and uses the same skillset as there folk designing the DW.

    1. Easy to do if it's all in the same DB and/or server. Often though, it's not, and you can't always use linked servers to get it. It also might not even be coming from a DB. SSIS is still great for "getting stuff from somewhere and putting it somewhere else".

  2. 100% agree: I tend to avoid SSIS unless there's no alternative. It's great when properly used for the right task, but when it breaks, it's insane madness!!

    TSQL and stored procs much easier to maintain.

Leave a Reply

Your email address will not be published.