SQL: Is there a need to check RI in data warehouses?

Betteridge’s law of headlines says that any headline that asks a question can be answered “no”. Well, contrary to that law, the TLDR answer to this is “yes”. I’m endlessly reading information that says that if your referential integrity is being checked in your OLTP application that’s providing data for your data warehouse, that you don’t need to then check it within the data warehouse. Sadly, this is often naïve thinking, for a number of reasons.

It’s a matter of trust.

The first and most simple example is that even if your OLTP application is perfect in terms of referential integrity, you could have a bug in your ETL process that’s taking that OLTP data and putting in into your data warehouse. You need to find that out as soon as it starts happening.

A second reason is that data warehouses often get data from multiple source systems. You might be depending upon data in one system making references to data in the other system. Again, if this goes wrong, you need to know about it immediately.

A third reason is that you can’t just trust the OLTP system. You could easily have bugs in it. When that happens, you don’t want the incorrect data in that system, polluting the downstream systems. And that situation only gets worse if you have multiple source systems that are feeding data to the data warehouse.

A fourth reason is that because they often store versioning for data and that original data is no longer even available in the source systems, you can’t always just reload a data warehouse from scratch if you let things get messed up.

There are many more reasons. In the end though, when I ask why people don’t have RI in place, the #1 excuse is “performance”. Yet when I ask them if they’ve actually tested it, I’m generally told that they haven’t but that someone’s brother’s friend read it somewhere on the Internet once.

Don’t be that person.

2018-11-19