SQL: How do I choose a clustering key in SQL Server?

Clustered indexes vs heaps

OK so you know that a table can either be a heap or it can have a clustered index. The first question of course, is which should I use?

If you have any doubt at all, put a clustered index on the table. Most SQL server tables will end up performing better that way. There are a few scenarios like log files where heaps will be preferable but if you’re not sure, start with a clustered index.

But which column or columns should I cluster on?

That then raises the question of what sort of keys or which columns are really the best to choose for that clustering index, and of course, like many things in SQL server, the answer is that “it depends on how it’s being used”.

For example, if you’re inserting data, and you’re looking at insert performance in the table, the choices you make will affect this. For this operation, I’d start by saying that the value needs to be static, because if the value changes, then that means the row has to move. That’s almost always a bad thing. Because the rows are kept in a logical sequence, you do not want to use a value that ends up changing, at all.

The second thing is, you want a short value. This value is going to be all throughout the index. Every single thing in the table is going to be sorted by that, and it’s the value that’s going to be present in the leaf level of every nonclustered index as well. You don’t want long values here.

Large key values like long strings, will not perform well for this.

You also want the values to be unique. If SQL server doesn’t know that your values are unique, it will automatically add some additional data that we call a uniqueifier to make sure that it can uniquely identify, every single row. Your world will be better if it knows that they are actually unique.

Finally, it might be helpful if the values are increasing. It doesn’t have to be what’s called monotonically increasing (that is going up by the same amount each time), but just always getting bigger. What that means is that data will just keep getting added to the end of the table. Most of the time that’s going to be helpful.

There’s always an “it depends”

There are some very high performance scenarios where the end of the table could then start to become a hotspot and can be a problem because all the inserts are happening there. But for the vast majority of tables you’ll be better off having the data that gets added to the table, being added to the end of the table and an increasing clustering key will do that for you.

Learn about indexing

Want to learn more about indexing? Take our online on-demand course now: https://training.sqldownunder.com/p/sql-server-indexing-for-developers

2019-01-07