BI: Can you explain where your analytic data came from?

I’ve seen many challenges with analytics over the years. One that’s painful is an inability to explain where analytic data came from. Someone looks at a report, sees a value, and says I don’t believe that number. Don’t put yourself in that position !

Lineage

I load analytics from data warehouses. Most of my data warehouses are SQL Server databases of some type. Currently, they’re almost always Azure SQL Databases. I like to include information in the database, about how the data got there i.e. the lineage of the data.

How can I record the lineage?

Most analytic data that I work with gets loaded/processed in batches. Sometimes it’s overnight. Other times it’s every few minutes. But either way, there’s a process that’s run, and it puts the data in place.

Each time the process runs, I put details in a table about the process run. And I put the key for that table into every row updated in that run.

What goes in the lineage table?

You might choose different values, but at a bare minimum I’d suggest:

When the process ran - I need to know when the data came in.

Which process - I need to know which process loaded and updated the data.

Source system - Which system/database/file/other source did the data come from?

Process identity - I need to know the identity that the process used when querying the source data. (Different identities might return different data from the source)

At least when someone asks about a row of data, you could at least say that it came from the XYZ source system at 12:24PM on May 12th 2021, and it was loaded and transformed by the PPPP package running as UUUU. You have a chance of establishing the validity of the data.

2021-01-13