I was really excited today to see details of the upcoming #dataweek. Anything with a name like that already has my interest.
In this case though, there's a #CreateData event that's happening as part of #dataweek. And the focus of that looks to be on Azure Synapse Analytics.
Azure Synapse Analytics
I know that many of my readers won't have spent much time with Azure Synapse Analytics. The product has grown out of Azure SQL Data Warehouse (note not Azure SQL Database) but it's now much more than just an upgraded Azure SQL DW.
It's a tool that combines data warehousing with Big Data analytics. So many of the early Big Data offerings struck me as just Big Hype. I wrote about that back in 2013. And early last year, I wrote about my concern that about how "modern" seems to be treated as a synonym for "better", but it certainly isn't. And Big Data was an example for me, in particular Hadoop.
My real concern was that most of these tools were only addressing a small part of what was needed.
Why Azure Synapse Analytics ?
One key advantage of Azure Synapse Analytics is that it tackles both the data warehousing and Big Data needs, in a single product. That's far more useful. And it's trying to do it without a lot of moving data around, as that introduces latency.
I spent the last two weeks on a DevOps project focussing on building and deploying Azure Synapse Analytics projects. I was impressed by how mature the tooling now was, compared to when I'd last looked at it.
If you're not across where this product is at, I'd encourage you to consider registering for #dataweek.
I work with a lot of technology on a daily basis, much of it amazing. But I still get excited when relatively small enhancements are made, and they make my life or development much better.
Timezone support in schedule triggers in Azure Data Factory is one of those changes.
In Data Factory, the most common type of trigger that we use is a schedule trigger. It's used to run a pipeline at a specific time or series of times. But one of the most painful aspects of these triggers is that they didn't have time zone support.
That meant that for a town like Melbourne or Sydney where daylight savings time applied, I couldn't set a pipeline to just run at say 2AM every day, Melbourne time. Every six months, I'd have to go in and change the trigger times to achieve that.
I'd also have to endlessly convert between our local times and UTC as that's the time that would be entered.
If you only have a few triggers, that might be OK, but when you have a lot of them, well that's painful.
Design was Ready
When I looked at the JSON representation of the trigger, there was a time zone in there:
So I knew that whoever designed it meant for it to work. But ADF wouldn't support any other value than UTC.
Great News !
In the last few days, we got the great news that time zones are now supported in these triggers.
You can see the option now in the UI:
I can't tell you how pleased I am to see that there.
When I'm working in client sites, I get really concerned when I see personal data not being handled or protected appropriately. And one of the biggest sins that I see is where developers have pretended to be encrypting data, but they really aren't encrypting it.
I'm sure that looks good for a manager but please, just don't do this !
When I look at the table definition shown in the main image above, my eye is drawn to the column called EncryptedTFN. In Australia, we refer to our tax numbers as TFNs. They are 11 digits long for most people and should never be stored in plain text in a database. The column should be encrypted.
At first glance, you might think it is encrypted, based on the column name. But I know immediately, without looking any further, that this column isn't encrypted.
If the column was encrypted, it would be much longer than 12 characters long. Real encryption couldn't afford to have an 11 digit number encoded into a 12 character field. It would be just far too easy to reverse engineer.
Encryption algorithms return much longer values. Often they'll also use "salt" to ensure that encrypted values aren't always the same. That makes them even longer.
If the column was encrypted, it would likely be a varbinary data type, not a character string. It is possible to encode a binary value as a character string, but if you did that, it would be so much longer again.
So what is this?
So what's happened here is that the developers have come up with their own "encryption algorithm".
Never ever ever do this!
All they've done is obfuscate the value i.e. make it harder to read. But for anyone comfortable with how encoding works, this would be trivial to "decrypt".
If you need to protect data, you need to use encryption. Algorithms for doing this are tricky to write but the hard work has already been done for you. SQL Server includes great options for using real encryption and hashing algorithms.
Don't be the one who gets blamed when this goes seriously wrong.
If you're not comfortable with using encryption, there's no time like the present to learn about it. We have a great online encryption course that covers everything from understanding how encryption works, certificates, keys, digital signatures on stored procs, transparent database encryption (TDE), extensible key management (EKM), Always Encrypted (with/without secure enclaves) and more.
Why not be the one in your team that knows how this works? It's never been more important. You'll find it here:
I'm pleased to let you know that version 20 of our free SDU Tools for developers and DBAs is now released. It's all SQL Server and T-SQL goodness.
If you haven't been using SDU Tools yet, I'd suggest downloading them and taking a look. At the very least, it can help when you're trying to work out how to code something in T-SQL. You'll find them here:
Along with the normal updates to SQL Server versions and builds, we've added the following new functions:
WeekdayOfSameWeek – this one was requested by Dave Dustin and returns a nominated day of the week for a given date. For example, it will let you find the Thursday of the week that 23 Oct 2020 falls in.
NearestWeekday – this one was requested by Martin Catherall and returns a nominated day that's closest to a given date. For example, it will let you find the closest Thursday to 29 Oct 2020. (rather than one in the same week as above)
CobolCase – we've had so many case options, I'm surprised we'd not had this one yet. It returns upper case words with dashes in-between i.e. COBOL-CASE-IS-COOL
DateOfOrthodoxEaster – we previously had a function to let you find the date of the next Easter but that was the Christian Easter. Now we've adapted a sample sent to use by Antonios Chatzipavlis to calculate the date of the Orthodox Easter for a given year.
IsLockPagesInMemoryEnabled – this one does just what it says. It determines whether or not lock pages in memory is enabled.
There are a few enhancements as well:
We've done the usual updates for SQL Server builds and versions (up to date as at the point of release).
DateDimensionPeriodColumns now has a wealth of new columns including IsStartOfMonth, IsEndOfMonth,
IsStartOfCalendarYear, IsEndOfCalendarYear, IsStartOfFiscalYear,
IsEndOfFiscalYear and IsStartOf and IsEndOf for all Quarters both Calendar and Fiscal
DateDimensionColumns now has quarters and start and end of month.
We had a new house built a while back, and in a few rooms there was a double switch: one for the light and one for a fan.
But which is which?
Now the old way to do that would have been to put a label on each one. Seems like a reasonable idea but unless that's a braille label, and you can read braille, that's not going to help you in the dark. You want to just reach in and turn on the correct switch. That's easy enough to do, but it really only works if the electrician who installed them followed a pattern i.e. the switch furthest inside might be the light, and the one closest to the door might be the fan.
If you only had one room like this, it mightn't matter much, but if you have several rooms, you'd hope they're done the same way in each.
But they weren't.
And that got me thinking about someone who does it the same way each time, and someone who doesn't. A licensed electrician might install it safely, and both switches work but a more professional electrician would check not only that it works, but that it was done in a consistent way, throughout the house, and throughout all the houses that they worked on.
When I see someone who installs things differently in different parts of a house, it leaves me with an uneasy feeling about the overall quality of their work. It's a clear indication of their lack of attention to detail. There might even be a standard for how that should be done. If there isn't, there should be. So perhaps there is a standard and they didn't follow it.
For a number of years I worked as an engineer for HP, in the fun days where they had the best commercial minicomputer systems in the industry. That involved working on large pieces of equipment (including some that were quite scary to work on), with an enormous number of pieces and screws to hold them together.
If I ever started to work on a piece of equipment, and found screws were missing, even if they were where the customer wouldn't see them, I got a really bad feeling about whoever worked on the same machine before me. Fortunately, I generally knew that whoever did it was unlikely to be based in our office, as the engineers doing work out of our office were meticulous about that. It was an unwritten rule about doing quality work.
And so the same things apply to development work. Are you hacking together something that works? Or are you aiming for more than that?
When someone needs to do further work on a project that you worked on, how will they feel about it?
I was reading posts in an email distribution list yesterday and someone asked if SQL Server Replication was deprecated. First up, let's just confirm that it's not. As Mark Twain said: "The reports of my death are greatly exaggerated". There's still a really strong need for it, and somewhat ironically, that need has been getting stronger lately.
Back when replication was introduced, it had a bit of a reputation for being hard to set up, and hard to look after. It was in the same category as features like clustering. If I was teaching a SQL Server 6.5 class, you could tell which students could follow instructions if they managed to get clustering working. Fortunately, it's nothing like that today but you'll still hear from people with "old pain".
Styles of Replication
SQL Server Replication is a general term, and it covers a number of features. That's part of the reason that you get mixed messages.
I don't use Merge Replication and I avoid it at all costs.
There, I've said it.
Merge works by adding triggers to tables and capturing the changes, then periodically synching up with other machines and trying to apply each other's changes. And no big surprise, conflicts are an issue. There are ways to resolve them, but they mostly involve overwriting someone else's changes. Even though one database is the publisher, each system involved is pretty independent. It's very similar in many ways to the form of replication that older Access systems used.
Where Merge falls down is in performance. It can be a real challenge, particularly if you have a complex topology. It's easy to end up with what we call "merge storms" where systems are just endlessly moving data around. It works ok for small numbers of systems and small volumes of changes, particularly if they aren't likely to conflict with each other.
If ever there was a word that is overloaded in SQL Server, it's snapshot. (OK, DAC might be close to that claim). In the case of replication though, the basic idea is that you just periodically copy over a full copy of the data.
This is simple and reliable. Where it works well is when you have small amounts of data and it's not worth tracking the changes, or if you have a relatively small amount of data and sending all the data is less work than sending the changes, because it's endlessly being modified.
It's an excellent option for pushing out reference data from a central database to many other databases.
This is the most common form of replication today. The transaction log of the publishing database is read (by a log reader process), the interesting stuff is extracted from the log and sent to a location where it's distributed. From there, it can be pushed to subscribers, or pulled by subscribers.
There's a less-commonly used variant of it called Peer to Peer Transactional Replication. We'd rarely use that option.
As for "standard" transactional replication, nowadays it's pretty easy to set up, and it's not that hard to look after. The tooling has improved markedly over the years.
But why would you use it instead of other options like readable secondaries on availability groups? Well, for lots of reasons. Here are my main reasons:
On-Premises SQL Server to Azure SQL Database for BI
A few years back, we got the ability to replicate data from on-premises SQL Server databases into Azure SQL Database. (Note: I'm not talking about just Azure SQL Managed Instances or SQL Server in Azure VMs, but real Azure SQL Database). This is a wonderful option and is causing a resurgence in the use of transactional replication.
We have many customer sites where they have on-premises SQL Server systems and this lets them push a copy of the data, quite up to date, into Azure SQL Database. That's used to either centralize the data, or to then perform analytics on it.
Let me give you an example: I've recently been dealing with a company that has locations across the country, all of which are running their own on-premises SQL Server based systems. We're able to add replicas in Azure SQL Database, and now we can access all the data from the source systems directly in Azure.
No need for the days where you'd be setting up VPNs from sites to the head office to get this sort of data. And if any site's link is down (including the head office), the others keep working. When the site comes back, the data just continues to flow.
And it's a really great source of data for Azure Analysis Services, and/or Power BI.
Subset of Data Required for Analytics
A fellow MCM commented to me recently that if you needed to move 100 tables totalling 2GB for analytics, and the whole database was 4TB and had 30,000 tables, an availability group replica would be entirely the wrong choice.
With transactional replication, I can replicate just the required tables. In fact, I can replicate just the required columns and rows.
When I have data replicated to another server, I can then use a different indexing strategy on that target server. This means that if I need indexes to run analytic queries and those indexes aren't needed on the source system (or might hinder it), I can just add them on the target. You can't do that with an availability group replica. You'd have to put the indexes on the primary database.
This is especially useful for 3rd party application databases where they'll get all upset if you start adding indexes to their database. You still need to check if they're ok with you replicating some of the tables in their database, but we've had good outcomes with this.
Schema and Data Type Differences
While we generally don't do this, there's nothing to stop you having different table definitions on the target systems. INSERT, UPDATE, and DELETE commands flow to the target and you can choose how they are applied. You can even write your own stored procedures to apply them in whatever way you want.
Large Numbers of Secondaries and No Domain Join Needed
With replication, you can have a large number of secondaries if all you're trying to do is to distribute data so that it's local to other systems.
And while availability groups have basic options for creation outside of domains, transactional replication works just fine in these situations.
You do need to learn how to work with, and manage, replication. It's easier today than it's ever been but most people that I see having issues with it, just haven't really understood something about how it works.
Given how useful it now is with the Azure SQL Database options, there's never been a better time to learn about it.
If you want a quick way to learn, we have an online course that will probably take you a day or two to complete. You'll find it here:
I've been in so many companies lately where new CTOs and CIOs claim to have a cloud focus, but all they want to do is migrate all their existing systems from on-premises VMs to cloud-based VMs. They talk about making a cloud transformation but they're not transforming anything.
I was pleased to get a chance to create a short educational series for the people at PASS to cover some of my thoughts on how to make a real transformation, not just a migration.
It's unfortunate that the SELECT query in SQL isn't written in the order that operations logically occur (if not physically). I suspect that's one of the things that makes learning SQL a bit harder than it needs to be.
Without getting into really complex queries, you need to understand the logical order of the operations.
The starting point is to determine where the data is coming from. This is normally a table but it could be other sets of rows like views or table expressions.
This clause is the initial filtering of the data that was retrieved in the FROM clause. This is all about finding the rows that need to be considered in the query.
This clause is about summarizing or aggregating the rows. We determine the columns that will be used for grouping. The SELECT column will normally later have these grouping columns so that you can make sense of the output. The SELECT might also have other columns but only if they are aggregated via MIN, MAX, AVG, etc.
Newcomers to SQL often get confused about the purpose of the HAVING clause, particularly when comparing it to the WHERE clause. The HAVING clause is a filter applied to groups of data that have come from the GROUOP BY clause, where the WHERE clause is filtering rows of data.
The SELECT clause is about deciding which columns or aggregates need to be returned. Formally it's often called a projection of the data. Columns or aggregates can also be aliased at this point (i.e. given another name). If you notice how late in the logical order the SELECT clause is, you might realize that this is why an alias can be used later in an ORDER BY but not in any of the previous clauses.
SQL queries don't have any default order. If you want data returned in a specific order, then you need to specify the columns or aggregates that you want to order the output by.
It's really important you start to have this picture in your head of the order that queries are logically sequenced.