There are so many useful things in SQL Server Management Studio (SSMS) and daily, I come across users who aren't aware of common useful functions.
A request that I hear all the time, is "I don't like tabs but <insert name of annoying colleague here> decided he likes to use them. How do I remove them?" Now SSMS allows you to choose to use spaces instead of tabs (in Tools > Options) but that doesn't fix existing tabs.
The regular expression functions in SSMS let you do this. Hit Ctrl-H to open the Find and Replace dialog (or use the Edit > Find and Replace > Quick Replace menus), then configure the window as follows:
Note that I've clicked the little square and asterisk thing (which says to use regular expressions), then put the change from text to \t which is a tab, and set the change to text to four spaces (or however many you want), then I click the Replace All button and magically all my tabs are replaced.
But this is just the start of what we can do.
For example, if I have a long row of comma-delimited values like this:
Now I could go to each comma and hit enter after each one. That's what I see most people doing. But you could do this:
I'm changing a comma to a comma followed by a carriage return. And magically, this is the outcome:
There are so many things you can use this regular expression functionality for within SSMS. I encourage you to try it out.
I might be somewhat anal when it comes to database design. (Probably almost any type of coding). Perhaps it's a mild form of OCPD-behavior, but inconsistency frustrates me. (I'm told that OCPD is the correct term, and that people often apply OCD to that behavior incorrectly).
Worse, inconsistency leads to nasty, hard-to-find errors and your applications become more and more fragile.
If I'm holding an email address and I define it as 70 characters in one place, 100 in another, and 160 in yet another, I have the chance of random failures when I'm moving data around or assigning values from place to place. (For email addresses, I recently discussed why they should be 320 characters anyway).
Same for an address, or phone number, but even more insidious if I use different precision in decimal values ie: 2 decimal places here, 4 over there, 3 in other places. Then, random rounding issues can occur.
No matter how hard organizations try, database designs routinely end up inconsistent.
How do you find it? We have a tool to help !
One of the SDU Tools is used to find all columns that have the same name, but are defined differently in different parts of the same database. You might get some false positives, but if you haven't run something like this on a database before, you might be surprised by what turns up.
The main parameter is the database name, but you can optionally limit it to one or more schemas (you can provide a comma-delimited list), one or more tables, and/or one or more columns.
Here's an example run against the WideWorldImporters sample database that we created for Microsoft:
You can see it in action here:
You can find out more about our free SDU Tools here:
Many years back, I was invited to chair a course accreditation panel for a local TAFE (Technical and Further Education) course. They had started to offer a computing-related 3 year diploma, and the hope was that it wasn’t too far below the 3 year degrees offered at local universities. One part of that accreditation process involved me discussing the course with the staff members who were teaching it.
After talking to almost all the staff, what struck me was how similar they all were. In the requirements for the course, there was a standard that each staff member needed to meet, but there was also a requirement for the group of staff to be diverse enough to have broad knowledge of the industry. There was no individual staff member that you could identify as not being at the appropriate standard, but almost all of them had exactly the same background, career progression, etc.
The manager had basically hired clones of himself. It’s an easy mistake to make. If you feel you are the right person for a particular type of job, then hiring more people like yourself must help correct?
A similar problem happens in areas like medical research. Taking a whole bunch of people with the same background and experience isn’t going to let you cut through tricky problems that need someone to think outside the box. Adding someone like a civil engineer into the mix might seem odd but can have surprising outcomes. At the very least, they might ask a question that leads someone else in the team to think differently.
I’m remembering this story because I see the same issue in application development groups.
I’ve done some work at a company that has over 400 developers. Data is almost all that they do, yet for most of the time the company has existed; they’ve had no-one focused on data. Everyone involved in development has a similar development background. They had many intractable data-related problems yet more and more of the same type of people wasn’t going to solve those.
Hiring a team of people who think and work like you do might seem like a good idea but it’s not. You need a mixture of people if you want to be really effective. (And that also means having gender and cultural diversity too).
When designing databases, one question that comes up all the time is how large columns should be.
For numbers, the answer is always "big enough but not too big". This week I've been working at a site where the client numbers were stored in int columns. Given the clients are Australians and the Australian Bureau of Statistics Population Clock says there are just under 25 million of us, an integer seems a pretty safe bet, given it can store positive numbers up over two billion. It's hard to imagine that number being exceeded, but I've seen people deciding that it needs to be a bigint. I doubt that. Even if we count all our furry friends, we aren't going to get to that requirement.
I was recently at a site where they were changing all their bigint columns to uniqueidentifier columns (ie: GUID columns) because they were worried about running out of bigint values. In a word, that's ridiculous. While it's easy to say "64 bit integer", I can assure you that understanding the size of one is out of our abilities. In 1992, I saw an article that said if you cleared the register of a 64 bit computer and put it in a loop just incrementing it (adding one), on the fastest machine available that day, you'd hit the top value in 350 years. Now machines are much faster now than back then, but that's a crazy big number.
For dates, again you need to consider some time into the future. It's likely that smalldatetime just isn't going to cut it. Most retirement fund and insurance companies are already working with dates past the end of its range. What you do need to consider is the precision of the time if you're storing time values as well.
The real challenge comes with strings. I've seen developer groups that just say "make them all varchar(max)" (or nvarchar(max) if they are after multi-byte strings). Let's just say that's not a great idea.
But if they aren't all going to be max data types, what size should they be? One approach is to investigate the existing data. If you haven't used it, SQL Server Integration Services has a Data Profiling Task that's actually pretty nice at showing you what the data looks like. If you haven't tried it, it's worth a look. It can show you lots of characteristics of your data.
One thing that I see people miss all the time though, are standard data items. I was at a site yesterday where sometimes email addresses were 70 characters, sometimes 100 characters, other times 1000 characters, and all in the same database. This is a mess and means that when data is copied from one place in the database to another, there might be a truncation issue or failure.
Clearly you could make all the email addresses 1000 characters but is that sensible? Prior to SQL Server 2016, that made them too big to be in an index. I'm guessing you might want to index the email addresses.
So what is the correct size for an email address? The correct answer is to use standards when they exist.
In addition to restrictions on syntax, there is a length limit on email addresses. That limit is a maximum of 64 characters (octets) in the "local part" (before the "@") and a maximum of 255 characters (octets) in the domain part (after the "@") for a total length of 320 characters.
So my #1 recommendation is that if there is a standard, use it.
Everyone working with databases knows that having too many indexes can be a problem. Indexes need to be modified whenever the data in the associated tables need to be modified, so yes it can slow down data inserts, updates, and deletes.
How big a problem is too many indexes?
Overall, I'm not as concerned about this as many other people seem to be. I always want to focus on what the system is spending its time doing, and on most systems that I work on, that's reading, not writing.
Any time I ask someone what percentage of the time does their system spend writing, they always tell me numbers like 10% or 20% or 30%. I even recently had discussions with a SQL product group member who told me that he thought 40% to 60% was common. I just don't believe any of these numbers, or at least not for OLTP systems. Every time I measure it, the outcome is closer to 1%, or perhaps 2%. So while I acknowledge that there are systems where that 1% of the effort is time-critical, I usually get much better outcomes from optimizing the 99% of the workload, not the 1%.
And if you optimize the 99%, no surprise that the 1% benefits as well.
Finding data to modify is important too
It's also important to keep in mind that to modify data, you need to find it first. Yes, I understand that if all you ever do is to access rows one at a time by their primary key (a big shout-out hello to all the ORM fans), that's not an issue. But if you have an address table and want to correct the name of a city in every row, you'll sure be glad you had an index to let you find those rows quickly. Even most systems that modify data heavily spend a bunch of time reading, not writing.
So, while we're acknowledging that the indexing issue often isn't as bad as made out, it's pointless having indexes that aren't being used. They are just unnecessary overhead.
I'll talk about many of these in other blog posts but there are two categories of index that I need to call out specifically.
Single column nonclustered indexes – these are so often pointless. Unless they end up being super-selective in your queries (and I'm talking way less than even 1% selectivity), you'll find that SQL Server will quickly end up ignoring them, as the cost of the associated lookups is too high. That's a topic for another day.
Subset indexes – these are the ones I want to highlight today. If I have an index on Columns A, B, and C and another index on Columns A and B, it's very, very likely that you could live without the subset index on Columns A and B only. Clearly other things like included columns need to be considered but as a general rule, these are indexes that come into the "code smell" territory.
So I have a tool for that. One useful proc in the SDU Tools collection is ListSubsetIndexes.
It takes a single parameter, either the name of a database, or a comma-delimited list of databases to check. If you omit the parameter or provide the value ALL, then all user databases on the server are checked.
You can see it in action here:
You can find out more about our free SDU Tools here:
I often see people using SQL Server Management Studio (SSMS) and doing very repetitive editing tasks that could easily be carried out by using the selection and changing of rectangular regions of code.
The simplest example of doing this is to insert a bit of text on a number of rows. Take the following code as an example:
I've got the skeleton of a list of columns in a CREATE TABLE statement but let's assume that I'm a "comma in front" person and want to put a few spaces and a comma, etc. in front of each column after the second. I could just put the cursor on the InvoiceID line and type what I wanted, then do the same again on the next line. I could do it on the first line, then select and copy it, and insert it into the front of every other line. But what I should do is put the cursor in front of InvoiceID, and while holding Alt-shift, use the down arrow to select the beginning of every line, then just type what I want.
Similarly, if I want to change text on a bunch of lines, I can do the same by using both down arrow and side arrow to select a rectangular region, then just type to replace. In this example, I have a table name that I want to change from InvoiceLines to OrderLines:
So I select the starting point, Alt-shift and arrows to select the rectangle, then just type:
If you weren't aware of this, I hope you find it useful.
This post covers another pair of useful functions that are part of the SDU Tools collection.
The first tool AsciiOnly deals with situations where you need to convert Unicode data (multi-byte data) to ASCII data (single byte). It processes each character and if it's not in the ASCII range, it replaces the character with the value that you supply. If your second parameter is an empty string, you'll effectively strip out the non-ASCII characters, but your replacement can also be one or more characters.
Finally, it gives you an option to strip out any control characters, so this same tool could be used to strip control characters (like carriage returns, linefeeds, or tabs) from a string.
Here's are some examples:
You can see AsciiOnly in action here:
The second tool is very similar but is used to strip everything from a string that isn't a digit (ie: a character from 0 to 9). Apart from cleaning up numbers, this could be useful as part of cleaning up phone numbers.
It has an option to keep or remove a leading sign character (ie: a plus or a minus sign). Again while this is useful for numbers, it could be used to keep a plus sign (international dialling prefix) at the start of a phone number.
Here are some examples:
You can see DigitsOnly in action here:
You can find out more about our free SDU Tools here:
I work with data. I understand that for most organizations, that the data they own is the most valuable asset the company owns.
One thing I've learned from working with data is that unlike application software, data generally outlives generations of applications, is often used by many different applications, and typically just morphs from shape to shape over time. It almost never gets totally refreshed.
This is a good thing.
I've been in the industry long enough to see many types of consultants. One type that I have the least time for, is the type that always wants to tear down or replace whatever is already there at an organization. It's far easier to just say "let's replace it all" than to try to work out what to do.
Many of these consultants don't really understand what's already in place, but because it looks different to what they are used to, it must be wrong, and it must be replaced.
A mature consultant might not like what they see but they take the time to consider what's already there.
The mature consultant is skilled enough to work out how to take the organization from a place that they don't want to be, to a place which is better than where they are now.
Memory is a precious resource on most systems today. SQL Server loves memory, so it's important to avoid wasting it.
One common cause of wasted memory is what we call plan cache pollution. The plan cache is the region where SQL Server stores compiled query plans (and query plan stubs).
There are some nice new DMVs that give you details of what's in that cache but nothing still compares to one of the older compatibility views sys.syscacheobjects. You can use it to see what's in your plan cache:
You'll see the different types of objects that are contained there but the one that's of interest to us today is the Compiled Plan.
In the image above, I don't know without looking further if the first three queries shown are actually the same query but they nicely demonstrate one of the issues. Because these plans are looked up via hashes of the queries, it's really important that every time you call T-SQL code, that you use consistent calling. Note above that DB_ID() is capitalized in the first row, and now in the next two. Also notice that there is an extra space after db_id() in the third row. All it takes to get different query plans is for anything to be different. Whitespace, capitalization, etc. all matter, even on case-insensitive servers.
What causes the biggest issue with query plans though is how parameters are defined. One big issue with many frameworks is that they don't define parameters correctly. This was a particular problem with Linq to SQL where under the covers it called the AddWithValue() method on the SqlCommand.Parameters collection. The problem with that method is it did not define the data type, and Linq to SQL had to derive the data type. What it did was to change the data type based on the length of strings. So "hello" was an nvarchar(5), "freddie" was an nvarchar(7) and so on.
What this meant is that your SQL Server plan cache could quickly become full of query plans for every combination of every length of parameter that was ever passed to it. In some later variations of this and other frameworks, they just now assign all strings to big buckets like nvarchar(4000) to try to avoid this problem.
It's one of the reasons that I'm not a fan of ORMs. When you find a problem with something like this, what do you do?
Well what Microsoft did, after they actually caused this problem, was to add an interesting option to SQL Server.
EXEC sp_configure 'optimize for adhoc workloads', 1; GO
Once this configuration option is enabled, SQL Server doesn't store a query plan the first time it encounters a particular query. It just stores a small stub so it can remember that it has seen the query before. Then the second time it sees a query, it stores the plan.
I believe this option really should be on for most systems. Yes, it slightly increases the amount of compilation that could occur but it avoids a truly polluted plan cache.