Book Review: Deciphering Data Architectures

I had some clear time this morning so I read a recent book called Deciphering Data Architectures (Choosing Between a Modern Data Warehouse, Data Fabric, Data Lakehouse, and Data Mesh) by James Serra.

Price

One comment I need to make is that for some reason, the O’Reilly titles seem to have become more expensive lately, and their freight options are expensive too. It was the same for this book. It was $112 AUD landed at my place. That’s so much more than any similar book that I’ve read lately, and this is not a large book.

Content

Overall, I liked this book but it was quite different to what I expected.

I don’t think it’s a book for data engineers, etc. It’s more of a primer on BI and analytics technologies for the last 20 years. And it’s set up at the architectural decision level. So while you’ll get a good coverage of many design options, you won’t find depth on any. It provides the sort of level of knowledge that a typical generic solution architect might need.

Given the timing of its release, I think I was expecting more on design issues around data lakes. Getting them right is hard work.

I have been a long-term observer, architect, and implementor of many of these systems and I’ve seen what works over the years and what doesn’t. I have to say that I see things quite differently to Microsoft’s current guidance and thinking and I was hoping to be more convinced about some of these concepts.

Too much of the current thinking seems to be driven by either quite inexperienced product managers, or the same people who, not long ago, were trying to convince us that Hadoop/HDInsight and related technologies were going to replace everything. I was entirely unconvinced by that hype too.

Technology/Terminology Used

The book aims to be technology-provider agnostic and it largely is, but the examples of specific technologies do tend to reflect James’ background with Microsoft.

When working with tabular data models, I’m not a fan of using the term “cube”.

Another thing I’ve never really liked is the term “modern data warehouse”. Unfortunately, Microsoft tends to push that one. I think that any time you call something “modern” or “new”, you have made a mistake. After all, a shipping box for Windows 3.1 still proudly says “New” even though it’s now 32 years old.

There’s also an implication that “modern” or “new” means “better” or “improved”. Often, that’s far from true. And it causes customers, or worse, their consultants, to implement things just because they are new. I’ve written previously about “modern” not being a synonym for “better”.

Technical

I agreed with most of the technical content in the book and found it technically accurate, for the most part.

An example of things that weren’t correct is describing a foreign key as a column. A foreign key (or a primary key) is a set of one or more columns. In tabular data models within the Microsoft stack, there is a restriction that relationships can only be formed with single columns, and so that’s also a restriction that we often place on underlying data warehouses built using relational technology, but that’s not an accurate description of how it works in relational databases. It’s a distinction that many people don’t get but a key is not a column.

I’m also not in love with data lakes to the degree that many other people are. As an example, it’s all very well to tell customers to put their data in delta/parquet files, but today, doing that is still fraught with issues around data types, naming conventions, ACID compliance only at the single table level, and so on, let alone the potential performance issues.

Much of this thinking comes from people who are in love with Databricks and who haven’t been using richer tools.

And while I can see the vision of where tools like Microsoft Fabric are heading, they still strike me as quite incomplete and green. The answer to far too many questions at present is “we’re hoping to get that done in the next year or so”.

Time will tell.

Summary

This book provides a good summary of most of the core technologies involved in providing analytic architectures for data. I think the best audience for it is solution architects who need to get across these technologies.

7 out of 10

2024-08-15