I really love SQL Server Integration Services (SSIS). In fact, I wish Microsoft was giving it far more love than it is. So many of my clients use it, but unfortunately, not everyone uses it well, or for the right things.
One really common mistake is to perform lots of row-by-row logic within the SSIS packages. Let's take an example of loading a fact table in a data warehouse as an example:
To load a fact table in a data warehouse that's based on a star schema, you need to find the dimension (usually surrogate) key for each dimension that's associated with the fact.
The Wrong Way
Now clearly you could use a Lookup task in SSIS to go and find the first dimension key, based upon the dimension business ID. If it's present, you add it to your data flow. If it's not present, perhaps you infer a new row, and add the new row's dimension key to the data flow.
Then you go and do it all again for the next dimension, and so on, and so on.
There are situations where this might make sense, but most of the time, it doesn't.
A Better Way
As an alternative, you could use SSIS to just orchestrate some T-SQL to do the same job:
- Stage the fact data in a table that has the required columns, plus a nullable column for every dimension key
- Perform a single update on that staging table, with left joins to look up all the dimension keys and record them when present
- Perform a single query for each dimension where you need to infer new keys and update the staging table
- Push the staged data into the target fact table.
The difference in performance is often very, very large.
I know that doing the T-SQL work requires a different skill set to what's required for the SSIS work, but whenever you're dealing with a large amount of data manipulation in SSIS, or now in ADF (Azure Data Factory), consider whether you should just be orchestrating T-SQL instead.