Data Modeling: a tale of a man with only one name

I've lost count of the number of computer systems that I've worked with over the years, that insist that everyone has at least two names. Yet there are many people in the community who only have a single name. I'm not just talking about people like Cher who might well have a stage name but then have another name on her passport. I'm talking about people who really only have one name.

Can your systems deal with that?

And if they only have one name, which column does it go into?

Need to be PC

In Australia, we have standards for this sort of thing. While many existing systems will use words like Surname, LastName, FirstName, ChristianName, etc. they aren't considered PC (politically correct or appropriate) now either. In our standards, the terms are:

Family Name
Given Names

Only the family name is mandatory. At Meteor (Metadata Online Registry), these are also defined as 40 characters long each. And only the family name is mandatory. That standard says:

Some people do not have a family name and a given name, they have only one name by which they are known. If the person has only one name, record it in the 'Family name' field and leave the 'Given names' field blank.

A Tale of a Disruptive Man with One Name

At a place that I previously worked, we had a guy apply for a job, and he only had one name. He'd legally changed his name to just a single word. He wanted nothing to do with his family any more and removed all reference to them. He was a pretty alternate sort of character but I so wanted him to come and work with us. He would have broken every system in the place.

But in an interesting twist of fate, he couldn't come to his job interview. The airline told us they were not prepared to allow him to fly in what he was wearing.

To this day, I can't imagine what you could wear to an interview but an airline wouldn't let you fly in it.

T-SQL 101: #17 Paginating rows returned from SQL Server T-SQL queries

When I need to display a potentially large number of rows on a screen, I often need to show them in pages. This is called paginating the rows.

For example, if I'm showing bank transactions, I might want to show 25 per page. The challenge is that I also need to be able to ask for a specific page number. So my query becomes: give me page 7 where there are 25 rows per page. In that case, I want to skip the first 150 rows, then ask for the next 25 rows.

Now I could get SQL Server to send me all the rows at once, and I could just choose which rows to display. If there are many thousands though, that could be really wasteful, mostly in transferring all that data from SQL Server to my application. And if I want to only show a single page at a time, I really don't want to get all the rows every time and then just throw most of them away.

While we could do this with T-SQL in all versions of SQL Server, it was messy. I was really pleased when SQL Server 2012 gave us an option to help with this. It came in the form of extra clauses to the ORDER BY clause.

Take a look at the following query:

It's quite elegant. It says to skip the first 150 rows, then return the next 25 rows, exactly as we wanted. To get any particular page, we just subtract one from the page, multiply by the number of rows per page, and use that as the offset:

OFFSET (@PageNumber – 1) * @RowsPerPage ROWS
FETCH NEXT @RowsPerPage ROWS ONLY

Note that this option isn't really a performance enhancement but that's a story for another day. I like the fact that it is declarative though. You are telling SQL Server what you want, not how to do it. That's almost always a good thing.

Learning T-SQL

It's worth your while becoming proficient in SQL. If you'd like to learn a lot about T-SQL in a hurry, our Writing T-SQL Queries for SQL Server course is online, on-demand, and low cost.

Fixing Locking and Blocking Issues in SQL Server – Part 4 – Deadlocks

This is part 4 in a series of posts:

  • Part 1 covered being sure there is a locking and blocking issue
  • Part 2 covered the impact of RCSI
  • Part 3 looked at the impacts of indexing on locking and blocking

Today, though, I want to look at the next potential issue, and that's deadlocks.

Deadlocks

I often hear people interchangeably using the words lock, block, and deadlock.

Locks are a standard feature of systems that are designed for concurrency. Nothing nasty to see here. When they become a problem is in two situations:

  • They are held too long and cause a blocking problem
  • Incompatible locks are held and cause deadlocks

The simplest example of a lock is when I have an item and you want it, but you must wait until I finish with it. That's just normal with sharing something.

If I finish using it quickly, you might barely even notice. But if I hang onto it for a very long time, you experience blocking. Apart from slowing you down (which might seem to be a problem to you), it's also not actually an issue. When I finish, you'll still get it.

As soon as we have more than one item though, things can get messy. If I have item A and I want item B, but you are holding item B because you also want item A, then we have a problem. That's a deadlock.

It also doesn't have to just be two people. I could have item A and want item B. Terry might have item B and want item C. Nga might have item C and want item D. Paula might have item D and want item A. A larger deadlock has occurred.

Resolving Deadlocks – the Old Way

In the past, I worked with systems that didn't resolve this automatically. What would typically happen is that two users would suddenly have their applications freeze. Then a little while later, another user needed something that one of the first two had, and her application would freeze too.

This would go on and on across the building. We used to call it creeping death syndrome.

Eventually, someone would get sick of waiting, and go to find an admin. The admin would work out which user caused the first problem and kill their process. Everyone would then come back to life.

Resolving Deadlocks – the SQL Server Way

In SQL Server, there was a conscious decision to not let this happen and to resolve it automatically. By the word "resolve" though, you need to read "kill one of the user's processes".

SQL Server has a process that looks for these situations and cleans them up. The only question, of course, was which process to kill.

In early versions, they used to kill whichever process had just put the last link the deadlock chain. That process would return an error explaining they had been chosen as a deadlock victim and roll back.

(Note: users don't take well to being told they're victims).

Over time though, this caused issues as long-running processes could collide with short-running processes and get rolled back. They could then have the same issue happen again and again.

So a decision was made to look at the amount of data that had been written to the transaction log by the processes, and kill the one that had done the least work. That worked way better.

That's how SQL Server currently handles deadlocks. In the next post, I'll talk about how you should plan to work around deadlocks.

 

SDU Tools: Migrate SQL Server Object Names to PostgreSQL Names

A while back I wanted to do some testing of Azure Database for PostgreSQL and I needed a database to work with. So I thought I'd try to migrate a basic WideWorldImporters database across to PostgreSQL.

I wrote scripts to do the main migration but then decided that I wanted to change the names of the columns.  While a large number of people use PascalCased names in SQL Server, it's really, really common to use snake_cased names in PostgreSQL. Object names in PostgreSQL are also limited to 63 characters. I needed to alter these names while generating the scripts and I didn't want to do that manually.

To make it easy, in our free SDU Tools for developers and DBAs, we added a function PGObjectName to help.

You can see it in action in the image above, and in the video here:

To become an SDU Insider and to get our free tools and eBooks, please just visit here:

http://sdutools.sqldownunder.com

 

Opinion: Some thoughts for today on professionalism

Who moved my cheese (diagram tool)?

I was part of an interesting email chain today. It started with a guy complaining that in SQL Server Management Studio 18x, the database diagram tool had been removed.

Now I was disappointed to see the tool gone. Mind you, I didn't ever think it was a great diagramming tool. I would have hoped they might have replaced it by a better one instead of just removing it.

Anyway, what caught my eye was his complaint that by removing it, Microsoft had stopped him doing his work effectively on "complex legacy databases". In particular, he told me how he would use the diagram tool to add and delete foreign key relationships in the databases that he worked on.

I have to admit to being a bit horrified by that. I cannot imagine almost ever wanting to do that, let alone routinely. 

If I want to add a relationship, at the very least script it and run the script. That would give me a record of what I did. Often I'll need to apply the same change to another copy of the database anyway. But even if there was only one database, if it had to be restored to before the change, what would I do? Make the change again and hope to do it the same way?

I mentioned that I'd really prefer this to be in a database project and in source control, and deployed from that.

Professionals vs Cowboys

And the guy mentioned that he always just made changes the simplest way, and moved on, mostly because his customers were disorganized and didn't ever have things like source control or places to keep scripts. It all sounded like pure cowboy stuff and left me thinking about a consultant's role in this.

Awesome image by Oleksii Hlembotskyi

Perhaps age will fix it?

I especially loved it when he assumed I was young and when I'd been around long enough (he thought another 20 years), I'd think the same way. I've actually been in this industry 42 years so far, 27 of them with SQL Server. I might just have to disagree with him on that. I'm far from sure I'll still be doing this in 20 years' time. I suspect that if I'm still around at all, I'll be doing something more relaxing.

Awesome image by Vlad Sargu

Simple Thought for the Day

Look, the message for today is simple: No matter how sloppy your customers are, you owe it to yourself to try to do professional quality work.

T-SQL 101: #16 Ordering the output of a T-SQL query

When you run a query against SQL Server, you might want the rows to be returned in a specific order. However, the first thing you should ask yourself is whether that's needed at all.

Sorting can be a very expensive operation if there is no suitable index that's already sorted the data for you. And the question is: should SQL Server be the one doing the sorting? In many cases, the client application that's reading the data might be a better place to do the sorting. That's not always true but it's worth considering. You often have way more clients than your single database server, and sorting data unnecessarily in the database server could limit your ability to scale your system to higher volumes.

Let's let SQL Server do it

However, let's assume that you do want SQL Server to sort the data. An important first concept is that there's no default order for data when you query SQL Server. I regularly see people who think that if a table has a primary key, or has a clustering key (long story we'll talk about another time), that the data will come back in that order.

It might, but you can't guarantee it. Don't assume that if it does in simple tests, that it will continue to do so in production volumes and situations.

If you want rows to come back in a specific order, you need to say which order you want.

And we do that with the ORDER BY clause. Look at these queries:

The simplest example would be to just order the rows by a single column. In the first query though, I've ordered by two columns. That means that the rows are sorted by Description first, and if the Description values are the same, they are then sorted by ProductID, within each Description.

Ordering can also be ascending (ASC) or descending (DESC). Ascending is the default and we usually don't put ASC in the queries, but there's nothing to stop you doing that if you prefer it.

When you have multiple columns in the ordering, each column can be sorted in a different order. In the second query, I've made the Description column sort descending, but then within each Description, the ProductID would be sorted ascending.

Learning T-SQL

It's worth your while becoming proficient in SQL. If you'd like to learn a lot about T-SQL in a hurry, our Writing T-SQL Queries for SQL Server course is online, on-demand, and low cost.

Fixing Locking and Blocking Issues in SQL Server – Part 3 – Indexing

This is part 3 in a series of posts:

  • Part 1 covered being sure there is a locking and blocking issue
  • Part 2 covered the impact of RCSI

Today, though, I want to look at the next potential issue, and that's indexing.

Indexing impacts on locking

It might seem obvious that you need to fix indexing issues in your database applications but it mightn't be so obvious that locking is one of the main things that is affected.

There are two ways that poor indexing impacts locking:

  • Query duration
  • Number of columns (and data volume) involved in the locks

One of the key points that I made in the previous post was that the longer queries run for, the longer they hold locks.

Before you start investigating any detailed issues around locking, you must ensure that the queries run as quickly as possible. 

So often, as soon as the queries run fast, the locking and blocking issues disappear.

But another more subtle issue surrounding indexes is that they can avoid the need to lock entire rows of data when reading. Take the following query plan running against AdventureWorks as an example:

This query just needs StateProvinceID from the address table. There is an index on the StateProvinceID column so SQL Server reads from that index rather than from the table. No big surprise there but note that only a small amount of data will be read and locked.

Now, look at this query plan:

In this case, only the City column was needed but note the index that SQL Server chose to read. It is an index on AddressLine1, AddressLine2, City, StateProvinceID, and PostalCode. Notice that City was not the first component of the index. So why did SQL Server choose to use this index?

Because it was the smallest object it could read that contained the information, even though it had to read the whole index.

Whenever it can, SQL Server tries to minimize the amount of data it's reading, so once again, the index here will reduce the amount of locking that's going on. And less locking means less potential for locking and blocking issues.

Next time I'll talk about the issue that seems to get the most attention: deadlocks.

Learning about indexing

If indexing isn't your "thing" (yet), now would be a great time to take our online on-demand course on SQL Server Indexing for Developers. (It will help DBAs too).

SDU Tools: Retrust Foreign Keys and Check Constraints in SQL Server

While doing my consulting work, I often come across Foreign Keys and Check Constraints that are not trusted. What this means is that either:

  • They were created without checking existing data
  • They were disabled for a period and re-enabled without re-checking the existing data

I like to see this fixed whenever possible. To make that easy, in our free SDU Tools for developers and DBAs, we added two stored procedures to help: RetrustForeignKeys and RetrustCheckConstraints.

Both procedures take three parameters:

  • @DatabaseName – the database to work in
  • @SchemasToList – a comma-delimited list of the schemas to include or the word ALL (which is the default)
  • @TablesToList – a comma-delimited list of the tables to include or the word ALL (which is the default)

You can see these two procedures in action in this video:

To become an SDU Insider and to get our free tools and eBooks, please just visit here:

http://sdutools.sqldownunder.com

 

T-SQL 101: #15 Using column and table aliases in SQL Server queries

Look carefully at the following queries:

I've shown several examples here of what are called aliases. There are two basic types of aliases.

Column Alias

In the first query, I've used a column alias. I wanted to query the Description column in the dbo.Products table but I wanted the column called ProductName instead. So I follow the column with the word AS and then the name I want to use for the alias.

The word AS is actually optional but I really like using it to make it clear what's going on. I've seen many instances where someone just leaves a comma out in a list of columns, and one column name becomes an alias for the previous column name.

The second query is an example of where you really do need to use aliases. I have selected an expression 2 + 3 and without the alias, there would be no name for the column that was returned. If you're just reading it on the screen, that might not matter, but generally, I'm writing queries that are sent by programs and I need a column name on the data that comes back. It's also useful even if all you're doing is copying the results into a program like Excel.

Now there are two ways that these aliases can be written. Instead of writing:

2 + 3 AS CalculatedValue

I could have written:

CalculatedValue = 2  + 3

I don't love using aliases that way but I have some pretty clever friends who wouldn't write queries any other way. They write the column list with column names one under the other, and they like the way that this form would put all the aliases directly under each other.

It's also possible for you to write this:

'CalculatedValue' = 2 + 3

But don't do that. Same for using double quotes around the name. If you have to quote the name for some reason such as it's a reserved word, or it contains spaces, write it this way:

2 + 3 AS [Calculated Value]

or

[Calculated Value] = 2 + 3

Table Alias

In the third query, I've shown an example of a table alias. Tables can be aliased just like columns can. I tend to do this now, pretty much all the time.

The basic argument for doing this is that if you have more than one table in the query, it can be pretty confusing to work out which column in a SELECT list is from which table. You could always put the table name in front of each column but that's long and painful. Instead, just alias the table, and use the table's alias in front of the column names.

OK, so multiple tables makes sense, but why would I do that for a query with just one table? Well, the first bonus is that Intellisense works great when you do that. If I type p then hit a period, I get a list of columns for just that table. And chances are that sometime soon, someone will change my query to add another table anyway. So I'd rather just have aliases right from the start.

Learning T-SQL

It's worth your while becoming proficient in SQL. If you'd like to learn a lot about T-SQL in a hurry, our Writing T-SQL Queries for SQL Server course is online, on-demand, and low cost.

Fixing locking and blocking issues in SQL Server – Part 2 – RCSI

This is part 2 in a series of posts:

  • Part 1 covered being sure there is a locking and blocking issue

In part 1, I talked about pseudo locking and blocking issues that come from having queries running too long. I said that before you start fixing any locking and blocking issues, that you first need to make sure your queries are running fast. Today I want to discuss RCSI.

Lock escalation issues

Some queries, however, just need to run for a long time.

Awesome image by JC Gellidon
Awesome image by JC Gellidon

Most of the time, I find this with reporting queries. They're just doing a lot of work.

The problem is that even in the default isolation mode (read committed), SQL Server takes shared locks. When it gets to 5000 of these locks, it tries to escalate to a table lock. (Strictly, it might be a partition lock but that's a topic for another day)

And that's where the fun begins. Many people have had the experience of working with large transaction tables; someone's running a report down the end of the table and someone else is trying to modify the beginning of the table and gets blocked. Suddenly, you can't even insert current data.

No the reaction that people had to this in SQL Server 2000 days, was to just put NOLOCK on the queries. But that's not a great solution. You are basically giving up consistency for performance. It can lead to scenarios that are incorrect, and hard to explain.

Ever since SQL Server 2005, so many of these issues would be better fixed by using snapshot isolation level. Changing that in a session means changing the application, so for many customers, that's not helpful.

One possible solution

Instead of changing the application to use snapshot isolation level, SQL Server provides a database option called read-committed snapshot isolation (RSCI) that changes any read-committed queries into read-committed-snapshot queries i.e. it automagically applies snapshot isolation to read-committed queries, but just for the duration of the queries.

I've lost count of how many sites I've been to where just applying that database option fixed an enormous number of blocking issues.

What it means is that when you go to read data, and someone else is modifying it in a transaction, you see the data as it was before the transaction started, instead of waiting. Generally, that's awesome.

So why isn't it already on?

If it's such a great option, you might be wondering why it's not already turned on. Basically, that's for backwards compatibility. Some applications (I've seen very few of these) depend upon the old behavior. I'd suggest that these would be pretty badly written applications.

When we start new database applications today, we turn on RCSI for the database before doing anything else.

The other reason that it's not a default value is that it is going to slightly increase the impact on tempdb. Generally, we've not found that to be an issue.

In Part 3 of this series of posts, we'll move on to trickier situations.