When I'm working in client sites, I get really concerned when I see personal data not being handled or protected appropriately. And one of the biggest sins that I see is where developers have pretended to be encrypting data, but they really aren't encrypting it.
I'm sure that looks good for a manager but please, just don't do this !
When I look at the table definition shown in the main image above, my eye is drawn to the column called EncryptedTFN. In Australia, we refer to our tax numbers as TFNs. They are 11 digits long for most people and should never be stored in plain text in a database. The column should be encrypted.
At first glance, you might think it is encrypted, based on the column name. But I know immediately, without looking any further, that this column isn't encrypted.
If the column was encrypted, it would be much longer than 12 characters long. Real encryption couldn't afford to have an 11 digit number encoded into a 12 character field. It would be just far too easy to reverse engineer.
Encryption algorithms return much longer values. Often they'll also use "salt" to ensure that encrypted values aren't always the same. That makes them even longer.
If the column was encrypted, it would likely be a varbinary data type, not a character string. It is possible to encode a binary value as a character string, but if you did that, it would be so much longer again.
So what is this?
So what's happened here is that the developers have come up with their own "encryption algorithm".
Never ever ever do this!
All they've done is obfuscate the value i.e. make it harder to read. But for anyone comfortable with how encoding works, this would be trivial to "decrypt".
If you need to protect data, you need to use encryption. Algorithms for doing this are tricky to write but the hard work has already been done for you. SQL Server includes great options for using real encryption and hashing algorithms.
Don't be the one who gets blamed when this goes seriously wrong.
If you're not comfortable with using encryption, there's no time like the present to learn about it. We have a great online encryption course that covers everything from understanding how encryption works, certificates, keys, digital signatures on stored procs, transparent database encryption (TDE), extensible key management (EKM), Always Encrypted (with/without secure enclaves) and more.
Why not be the one in your team that knows how this works? It's never been more important. You'll find it here:
I'm pleased to let you know that version 20 of our free SDU Tools for developers and DBAs is now released. It's all SQL Server and T-SQL goodness.
If you haven't been using SDU Tools yet, I'd suggest downloading them and taking a look. At the very least, it can help when you're trying to work out how to code something in T-SQL. You'll find them here:
Along with the normal updates to SQL Server versions and builds, we've added the following new functions:
WeekdayOfSameWeek – this one was requested by Dave Dustin and returns a nominated day of the week for a given date. For example, it will let you find the Thursday of the week that 23 Oct 2020 falls in.
NearestWeekday – this one was requested by Martin Catherall and returns a nominated day that's closest to a given date. For example, it will let you find the closest Thursday to 29 Oct 2020. (rather than one in the same week as above)
CobolCase – we've had so many case options, I'm surprised we'd not had this one yet. It returns upper case words with dashes in-between i.e. COBOL-CASE-IS-COOL
DateOfOrthodoxEaster – we previously had a function to let you find the date of the next Easter but that was the Christian Easter. Now we've adapted a sample sent to use by Antonios Chatzipavlis to calculate the date of the Orthodox Easter for a given year.
IsLockPagesInMemoryEnabled – this one does just what it says. It determines whether or not lock pages in memory is enabled.
There are a few enhancements as well:
We've done the usual updates for SQL Server builds and versions (up to date as at the point of release).
DateDimensionPeriodColumns now has a wealth of new columns including IsStartOfMonth, IsEndOfMonth,
IsStartOfCalendarYear, IsEndOfCalendarYear, IsStartOfFiscalYear,
IsEndOfFiscalYear and IsStartOf and IsEndOf for all Quarters both Calendar and Fiscal
DateDimensionColumns now has quarters and start and end of month.
We had a new house built a while back, and in a few rooms there was a double switch: one for the light and one for a fan.
But which is which?
Now the old way to do that would have been to put a label on each one. Seems like a reasonable idea but unless that's a braille label, and you can read braille, that's not going to help you in the dark. You want to just reach in and turn on the correct switch. That's easy enough to do, but it really only works if the electrician who installed them followed a pattern i.e. the switch furthest inside might be the light, and the one closest to the door might be the fan.
If you only had one room like this, it mightn't matter much, but if you have several rooms, you'd hope they're done the same way in each.
But they weren't.
And that got me thinking about someone who does it the same way each time, and someone who doesn't. A licensed electrician might install it safely, and both switches work but a more professional electrician would check not only that it works, but that it was done in a consistent way, throughout the house, and throughout all the houses that they worked on.
When I see someone who installs things differently in different parts of a house, it leaves me with an uneasy feeling about the overall quality of their work. It's a clear indication of their lack of attention to detail. There might even be a standard for how that should be done. If there isn't, there should be. So perhaps there is a standard and they didn't follow it.
For a number of years I worked as an engineer for HP, in the fun days where they had the best commercial minicomputer systems in the industry. That involved working on large pieces of equipment (including some that were quite scary to work on), with an enormous number of pieces and screws to hold them together.
If I ever started to work on a piece of equipment, and found screws were missing, even if they were where the customer wouldn't see them, I got a really bad feeling about whoever worked on the same machine before me. Fortunately, I generally knew that whoever did it was unlikely to be based in our office, as the engineers doing work out of our office were meticulous about that. It was an unwritten rule about doing quality work.
And so the same things apply to development work. Are you hacking together something that works? Or are you aiming for more than that?
When someone needs to do further work on a project that you worked on, how will they feel about it?
I was reading posts in an email distribution list yesterday and someone asked if SQL Server Replication was deprecated. First up, let's just confirm that it's not. As Mark Twain said: "The reports of my death are greatly exaggerated". There's still a really strong need for it, and somewhat ironically, that need has been getting stronger lately.
Back when replication was introduced, it had a bit of a reputation for being hard to set up, and hard to look after. It was in the same category as features like clustering. If I was teaching a SQL Server 6.5 class, you could tell which students could follow instructions if they managed to get clustering working. Fortunately, it's nothing like that today but you'll still hear from people with "old pain".
Styles of Replication
SQL Server Replication is a general term, and it covers a number of features. That's part of the reason that you get mixed messages.
I don't use Merge Replication and I avoid it at all costs.
There, I've said it.
Merge works by adding triggers to tables and capturing the changes, then periodically synching up with other machines and trying to apply each other's changes. And no big surprise, conflicts are an issue. There are ways to resolve them, but they mostly involve overwriting someone else's changes. Even though one database is the publisher, each system involved is pretty independent. It's very similar in many ways to the form of replication that older Access systems used.
Where Merge falls down is in performance. It can be a real challenge, particularly if you have a complex topology. It's easy to end up with what we call "merge storms" where systems are just endlessly moving data around. It works ok for small numbers of systems and small volumes of changes, particularly if they aren't likely to conflict with each other.
If ever there was a word that is overloaded in SQL Server, it's snapshot. (OK, DAC might be close to that claim). In the case of replication though, the basic idea is that you just periodically copy over a full copy of the data.
This is simple and reliable. Where it works well is when you have small amounts of data and it's not worth tracking the changes, or if you have a relatively small amount of data and sending all the data is less work than sending the changes, because it's endlessly being modified.
It's an excellent option for pushing out reference data from a central database to many other databases.
This is the most common form of replication today. The transaction log of the publishing database is read (by a log reader process), the interesting stuff is extracted from the log and sent to a location where it's distributed. From there, it can be pushed to subscribers, or pulled by subscribers.
There's a less-commonly used variant of it called Peer to Peer Transactional Replication. We'd rarely use that option.
As for "standard" transactional replication, nowadays it's pretty easy to set up, and it's not that hard to look after. The tooling has improved markedly over the years.
But why would you use it instead of other options like readable secondaries on availability groups? Well, for lots of reasons. Here are my main reasons:
On-Premises SQL Server to Azure SQL Database for BI
A few years back, we got the ability to replicate data from on-premises SQL Server databases into Azure SQL Database. (Note: I'm not talking about just Azure SQL Managed Instances or SQL Server in Azure VMs, but real Azure SQL Database). This is a wonderful option and is causing a resurgence in the use of transactional replication.
We have many customer sites where they have on-premises SQL Server systems and this lets them push a copy of the data, quite up to date, into Azure SQL Database. That's used to either centralize the data, or to then perform analytics on it.
Let me give you an example: I've recently been dealing with a company that has locations across the country, all of which are running their own on-premises SQL Server based systems. We're able to add replicas in Azure SQL Database, and now we can access all the data from the source systems directly in Azure.
No need for the days where you'd be setting up VPNs from sites to the head office to get this sort of data. And if any site's link is down (including the head office), the others keep working. When the site comes back, the data just continues to flow.
And it's a really great source of data for Azure Analysis Services, and/or Power BI.
Subset of Data Required for Analytics
A fellow MCM commented to me recently that if you needed to move 100 tables totalling 2GB for analytics, and the whole database was 4TB and had 30,000 tables, an availability group replica would be entirely the wrong choice.
With transactional replication, I can replicate just the required tables. In fact, I can replicate just the required columns and rows.
When I have data replicated to another server, I can then use a different indexing strategy on that target server. This means that if I need indexes to run analytic queries and those indexes aren't needed on the source system (or might hinder it), I can just add them on the target. You can't do that with an availability group replica. You'd have to put the indexes on the primary database.
This is especially useful for 3rd party application databases where they'll get all upset if you start adding indexes to their database. You still need to check if they're ok with you replicating some of the tables in their database, but we've had good outcomes with this.
Schema and Data Type Differences
While we generally don't do this, there's nothing to stop you having different table definitions on the target systems. INSERT, UPDATE, and DELETE commands flow to the target and you can choose how they are applied. You can even write your own stored procedures to apply them in whatever way you want.
Large Numbers of Secondaries and No Domain Join Needed
With replication, you can have a large number of secondaries if all you're trying to do is to distribute data so that it's local to other systems.
And while availability groups have basic options for creation outside of domains, transactional replication works just fine in these situations.
You do need to learn how to work with, and manage, replication. It's easier today than it's ever been but most people that I see having issues with it, just haven't really understood something about how it works.
Given how useful it now is with the Azure SQL Database options, there's never been a better time to learn about it.
If you want a quick way to learn, we have an online course that will probably take you a day or two to complete. You'll find it here:
I've been in so many companies lately where new CTOs and CIOs claim to have a cloud focus, but all they want to do is migrate all their existing systems from on-premises VMs to cloud-based VMs. They talk about making a cloud transformation but they're not transforming anything.
I was pleased to get a chance to create a short educational series for the people at PASS to cover some of my thoughts on how to make a real transformation, not just a migration.
It's unfortunate that the SELECT query in SQL isn't written in the order that operations logically occur (if not physically). I suspect that's one of the things that makes learning SQL a bit harder than it needs to be.
Without getting into really complex queries, you need to understand the logical order of the operations.
The starting point is to determine where the data is coming from. This is normally a table but it could be other sets of rows like views or table expressions.
This clause is the initial filtering of the data that was retrieved in the FROM clause. This is all about finding the rows that need to be considered in the query.
This clause is about summarizing or aggregating the rows. We determine the columns that will be used for grouping. The SELECT column will normally later have these grouping columns so that you can make sense of the output. The SELECT might also have other columns but only if they are aggregated via MIN, MAX, AVG, etc.
Newcomers to SQL often get confused about the purpose of the HAVING clause, particularly when comparing it to the WHERE clause. The HAVING clause is a filter applied to groups of data that have come from the GROUOP BY clause, where the WHERE clause is filtering rows of data.
The SELECT clause is about deciding which columns or aggregates need to be returned. Formally it's often called a projection of the data. Columns or aggregates can also be aliased at this point (i.e. given another name). If you notice how late in the logical order the SELECT clause is, you might realize that this is why an alias can be used later in an ORDER BY but not in any of the previous clauses.
SQL queries don't have any default order. If you want data returned in a specific order, then you need to specify the columns or aggregates that you want to order the output by.
It's really important you start to have this picture in your head of the order that queries are logically sequenced.
I saw the above message when working with one of my clients today. The error says Only users with Power BI Pro licenses can publish to this workspace. And that would make sense if they hadn't already purchases Power BI Pro licenses for the user.
I checked online, and there were a number of comments about people seeing this error. There were the usual suggestions of logging out and back into Power BI. There was even one who'd quoted a Microsoft support person who said that Pro licenses don't work for up to 24 hours after you purchase them. (That sounds dubious to me).
We checked that the license had been assigned to the user. I was concerned that perhaps the licenses were purchased but not actually assigned to the users. All was good with the license assignment.
But then we noticed something odd:
Note that the user still had their Free Power BI license assigned to them, and for some unknown (and not very sensible) reason, that was the one that the system was using when they logged in.
We removed the free license assignment and all was then good.
Many years ago, I spent a lot of time in universities. I ended up finishing my studies at QUT in Brisbane
and I have a great and continuing fondness for that institution. Earlier on though, amongst other universities, I did quite a lot of study through Charles Sturt University
Over the years, I've maintained a continuous link with my friends at Charles Sturt University (CSU). Back when we used to run Code Camps for both developers and DBAs, CSU were only too pleased to jump in to help us. From the minute we arrived that first day in Wagga Wagga, I knew it was going to be good. Associate Profession Irfan Altas was an amazing help and remains a friend to this day. I'm always pleased to get to chat to him.
Irfan asked me to be the guest speaker at a CSU graduation a few years back. That was a great honour.
What many people might not realise is that I've also been helping as an industry supervisor for students in PhD or Doctor of IT programs.
And that's the reason for this post. I was so excited yesterday to hear that one of the students that I've been supervising (in this case together with Professor Oliver Burmeister) has completed all the requirements to be admitted to his Doctor of IT degree.
Congratulations to Dr Georg Thomas ! That's a major life achievement for you.