I've seen many challenges with analytics over the years. One that's painful is an inability to explain where analytic data came from. Someone looks at a report, sees a value, and says I don't believe that number. Don't put yourself in that position !
Lineage
I load analytics from data warehouses. Most of my data warehouses are SQL Server databases of some type. Currently, they're almost always Azure SQL Databases. I like to include information in the database, about how the data got there i.e. the lineage of the data.
How can I record the lineage?
Most analytic data that I work with gets loaded/processed in batches. Sometimes it's overnight. Other times it's every few minutes. But either way, there's a process that's run, and it puts the data in place.
Each time the process runs, I put details in a table about the process run. And I put the key for that table into every row updated in that run.
What goes in the lineage table?
You might choose different values, but at a bare minimum I'd suggest:
When the process ran – I need to know when the data came in.
Which process – I need to know which process loaded and updated the data.
Source system – Which system/database/file/other source did the data come from?
Process identity – I need to know the identity that the process used when querying the source data. (Different identities might return different data from the source)
At least when someone asks about a row of data, you could at least say that it came from the XYZ source system at 12:24PM on May 12th 2021, and it was loaded and transformed by the PPPP package running as UUUU. You have a chance of establishing the validity of the data.