Book Review: SQL for Data Analytics

I recently received a review copy of SQL for Data Analytics: Analyze data effectively, uncover insights and master advanced SQL for real-world applications by Jun Shan, Haibin Li, Matt Goldwasser, Upom Malik, Benjamin Johnston from my friends at PackT. It is the fourth edition of this book.

Authors

Jun Shan is a principal cloud solution advisor.
Haibin Li is a lead predictive modeler. He was a technical reviewer for the third edition of this book.
Matt Goldwasser is Vice President and head of AI and Data Science for Global Distribution at T.Rowe Price.
Upom Malik is a data science and analytics leader.
Benjamin Johnston is a senior data scientist for a leading data-driven MedTech company.

Content

The structure of the book is good. Whenever I see a book with many authors, I get nervous, as invariably the style changes throughout the book. I’ve seen so many books where this was a problem. My best example was one book that had 17 authors. (I wish I was joking). I didn’t find any real issue with that in this book.

It is worth noting, with any book on SQL, is that it really targets a particular variant of the SQL language. In this case, PostgreSQL is the target.

The main part of the book covers a predictable structure:

Some core data modeling concepts, plus information on setting up PostgreSQL.
Creating tables, with a discussion on both CRUD (Create, Read, Update, Delete), and on basic data types and constraints.
Importing data by using the COPY statement.
Details of working with SELECT and filtering.
Transforming and updating data with CASE statements, a number of functions for the main datatypes, and some more information on working with NULL values.
Aggregating data with GROUP BY and HAVING.
Some introductory content on window functions.
Some solid details on processing JSON and arrays.
Details on a number of advanced data types. It was good to see more information on handling dates and intervals, as so many people do this poorly. And the extra information on string and geospatial handling was also welcome.

Reading order

A few things seemed out of place in the reading order, at least for me.

Chapter 4 on manipulating data with Python. It felt like this chapter didn’t fit in this spot in the book. Nothing wrong with the chapter, just curious positioning. Everything before it, and immediately following it was SQL. I think it could have just preceded chapter 13 where some introductory statistics were discussed.

The end of chapter 6 covered creating user-defined functions. I’m not sure how I feel about tackling that before finishing discussions on reading data, and before even getting to joins. Most readers of this book will be concentrating on reading data and many would have no permission to create user-defined functions, at least not in their normal day-to-day work. Obviously they could do that on their own instance of PostgreSQL while reading the book.

Chapter 10 covered basic concepts on SQL performance, and indexing. Again, my concern for this is the target audience for the book. There really isn’t enough information here to let them do this work properly, and that’s fine, again given the target audience.

Case study

The book closes out with an analytical case study. This is a good way to end this book.

Code files and conventions

I’m always pleased to see books nowadays that have GitHub projects with the supporting code, and this book was no exception. Great work !

It’s really important that people have access to the sample code, and the sample database. All was provided.

Throughout the book, I found the examples and code to be well-presented and easy to read. There are a few editing inconsistencies but they aren’t common. Example: capitalization of the functions in Table 8.2.

Possible improvements and errata

With any book on data modeling and the SQL language, there will be aspects of the style used in the book that I’m likely to prefer to do differently. Here are a few examples:

I wouldn’t ever use a US social security number as an example of a primary key. Keys end up all over databases and privacy rules would not permit this. It’s more likely to be a masked/hidden/protected column in a database. This appeared in several places. (I was, however, pleased to see a strong distinction made between a column and a key)
I understand the reasons for using pgAdmin as a client application, as it is standard. But I can’t say I love inflicting it on users when there are other, better, free options available.
I’d like to have seen a simpler GUI based option for installing the sample database, given the target audience. You can easily do that in pgAdmin.
I would like to have seen a database design where all table names are plural or singular. All were plural here except for one.
I might have missed it, but I didn’t see an option to download a database diagram for the sample database. If there isn’t one, it would be a good addition.
ANSI SQL suggests TRUE and FALSE for boolean values, not True and False.
It would be good to show common aliases for data types in the table where they introduced. For example I don’t know anyone who writes timestamp without time zone, even though it is the full name. They just write timestamp. And so did the authors later in the examples.
I wouldn’t show examples where SELECT * was being used as a rowset source for an INSERT statement. We don’t want readers building fragile code. And in so many cases, it’s not a good idea anyway, yet a large number of examples in the book include SELECT *.
For these new learners, I’d rather see ANSI SQL standard FETCH being used instead of LIMIT in SELECT clauses. It is supported in PostgreSQL.
ORDER BY 1 etc. is not recommended for solid code or for maintainability. Some SQL variants are starting to deprecate the use of constants in ORDER BY clauses. In the example shown, the column has an alias. Better to just use the alias. Related comment would apply to GROUP BY 1 although that needs a different solution.
When discussing operators in WHERE clauses, I would like to have seen IS [NOT] DISTINCT FROM covered early on. NULL has already been discussed at that point.
Would be good to have covered the (often) unexpected outcomes from using NOT IN with nullable entries in the list.
Please don’t write dates in US date format and don’t encourage users to set their date format to a local format. In the database, use ISO-8601 format instead.
It would be good to spell out that not all users who can execute DELETE can execute TRUNCATE TABLE.
I would prefer the readers to get used to using 2-part object names, with schemas shown. Again, this leads to more robust code that’s less dependent upon configuration.
When columns have aliases in examples, the optional keyword AS is used. But when tables have aliases, it’s not used. Not sure why. I like to see it all the time for clarity.

Other things I liked

There are many other things in the book that I really liked:

I liked the focus on how to import and export data. That will be very important for the readers of this book, and is an area that many struggle with.
I liked seeing grouping sets included. They are so often omitted in books like this.
I liked seeing detailed coverage of window functions, along with OVER and PARTITION BY. But I liked seeing the WINDOW clause and framing discussions even more.
I liked the statistical information in chapter 13.

Summary

Overall, this is a very good book. Writing a book like this is a substantial undertaking. And writing it with multiple authors, and with clarity and consistency is even harder. Even though I’ve pointed out a number of things I’d do differently, the authors have done a good solid job here.

8 out of 10

2025-11-23