GUIDs (uniqueidentifiers in SQL Server) are well understood to cause fragmentation issues when used as clustering keys for SQL Server tables. It's easy to understand why, but is NEWSEQUENTIALID the answer? I don't think so. Here's why.
When a SQL Server table has a clustered index, it needs to maintain the logical sequence of rows in the order of the clustering key. There are a few important notes here:
- I've mentioned key, not column. The key might involve more than one column.
- I'm not talking about primary keys. While the default for SQL Server is to make primary keys clustered if there isn't already another clustering key, primary keys are a logical construct unrelated to how the table data is stored.
- I'm not talking about physical storage. I often hear people say that clustered indexes determine how data is physically stored. That was in old Microsoft courseware, etc. and just isn't true. It's about logical ordering. If not, there would be no fragmentation and changing the data would take forever.
- The problem with using GUIDs as clustering keys is that they are typically generated in a random order using a function like NEWID() or using a function like System.Guid.NewGuid() in .NET (or similar functions in other languages).
Let's take a look at this by creating a table and populating it:
I've used NEWID() to populate the ImportantFactID column, and because we've declared that column as a primary key, and no other index is clustered, it will also be the clustering key. So the table will be clustered on ImportantFactID.
Now let's look at the values that have been inserted:
Clearly the table data is being inserted in quite a random order. And this is what leads to the fragmentation.
So, it would seem sensible that we'd want to insert data in order. Mostly we do want that. There are some very high-end scenarios where the random inserts are better, but for most people, they are a problem.
Because this had become a common problem, SQL Server 2005 introduced NEWSEQUENTIALID(). It produces values that look like GUIDs but always increase in value.
Notice that you can directly create a value with NEWID():
But you can't do that with NEWSEQUENTIALID():
NEWSEQUENTIALID() can basically only be used as a column default. Fair enough, but what do the values look like?
Let's recreate the table and find out:
And let's look at the data that was inserted:
The first thing that should strike you is that the values are now nothing like random. They are increasing but note that they are also predictable. That mightn't concern you but I work at many sites where some reliance is placed on people not being able to guess GUID values. Clearly this isn't the case with NEWSEQUENTIALID().
If you also wonder if you are now more likely to have a collision on generated values, the answer is "yes".
But is the concept of them being ever increasing the answer to our problem?
I'd argue that it isn't. You need to consider why people use GUID values in the first place. 99% of the time, it's so that any layer of code can generate an ID without having to refer to a central ID controller. I can write an application where I generate an order ID, and attach order details to that order ID, and then throw the lot into the database without concern about colliding with something else, and without needing to make a round-trip to the database to first get a new ID.
As soon as I'm using NEWSEQUENTIALID(), at least as soon as I'm using the T-SQL function, I need to go to the database to get an ID as it's only available as a column default. If I'm going to the database to get an ID, what advantage does a pseudo-GUID give me over just getting a bigint? You might as well just get a bigint.
For me, the T-SQL NEWSEQUENTIALID() function completely misses the point of why people use GUID values in the first place and it seems pointless to me.
Automatically increasing GUID values might have a place, but if they do, it's only the ones that are generated in upper layers of code, not those generated in the database.
My verdict is that NEWSEQUENTIALID() is in the category of "nice try, seemed a good idea at the time, but completely missed the target".
Interesting post.
The documentation of uniqueidentifier says:
"Merge replication and transactional replication with updating subscriptions use uniqueidentifier columns to guarantee that rows are uniquely identified across multiple copies of the table."
So for instance in the use case of syncing data from your mobile(s) to a central database, the newsequentialid() still has its use over using a bigint.
Would you happen to know if there is a technical reason why SQL Server does not allow to do a select NEWSEQUENTIALID()?
No idea on why you can't do a SELECT on NEWSEQUENTIALID(). My guess is that because it's only a small range and keeps increasing, that burning them as part of SELECT operations probably isn't a good idea. But that's just speculation.
Merge replication is renowned for being slow. It's just not the best way to do that any more. There are so many other options but merge replication is the easy built-in one. I'm often at sites where merge replication has been used in volume, and they end up with merge storms. You also need to spend time performance tuning the sp_MSSetupBelongs stored procedure. (Name off the top of my head)
And worth noting that even though updating subscribers to transactional replication is mentioned, it's now deprecated and shouldn't be used.
I think you completely misunderstood the use case for SEQUENTIAL guids. The idea is that you want a universally unique ID but you also don't want the fragmentation of the indices that a "normal" guid creates. SEQUENTIAL guid's solves both these problems. Generating SEQUENTIAL guid's from code is also possible by calling UuidCreateSequential in rpcrt4.dll in Windows.
Not at all. I totally get why they thought it would be of value but I think it misses the mark. UuidCreateSequential (or any sort of COMB GUID) is ok to call from application code, even with all its failings. But that's done at the app layer, and is of value because it avoids the need to go to a central "number-issuer" like a database first.
NEWSEQUENTIALID however, can only be used as a database column default. By itself, it's far more likely than a GUID to have a collision, and it has known potential issues with duplicates after reboots.
More importantly though, if you are going to go to the database to get a value, what is the point of getting one that's not even a good GUID? What is the advantage of that over a bigint? You'd be much better picking up a value from a bigint sequence. And if cross-machine collisions are a concern, you're also going to be better off with a partitioned sequence across those same machines.
It is not far more likely to have a collision, I don't know where you get that from. Neither is there any risk of duplicates after reboot. The guid is guaranteed to be globally unique just like a normal guid. Just because you can guess the next id by looking at an existing does not mean that a guid generator algorithm would be any more likely to generate a collision to that guid than any other guid.
Also I don't see a problem with returning the generated ID when doing an insert if you need it back.
As for comparing with big int that is a completely different use case, if I don't require any global uniqueness I will of course go with int.
Any time you constrain the possible next values, you are, by definition, increasing the chance of a collision. After a reboot, these values can start again with a lower value. There is no guarantee of global uniqueness. That's described on the documentation page for the function. Uniqueness of these is based on MAC addresses for the system being unique. That cannot be relied upon. Some systems don't even have MAC addresses but there are also numerous examples of duplicate MAC addresses in both physical and virtual network adapters. Normally that's only a problem if you strike two of them in the same subnet but if you are using them for global uniqueness, that's not valid.
But regardless, a bigint is going to be far more efficient to work with than any type of uniqueidentifier value, so why use it? If the value was generated in the application tier, that totally makes sense. But if the way it's generated is to go to the database to get a value, what is the point of returning a value that's awkward to work with, and less efficient to work with? You could just as easily number each of your servers, and give each say a non-overlapping range of a thousand billion bigints. Your system would be easier to work with and faster.
Avoiding round trips by doing this in the application tier though, is a completely different story. But NEWSEQUENTIALID can't be used that way.
I know this article is a couple of years old, but I've just been looking at a problem of fragmentation caused by a newid() default, and whether NEWSEQUENTIALID ( ) would help.
You said above that "There is no guarantee of global uniqueness", but according to Microsoft (https://docs.microsoft.com/en-us/sql/t-sql/functions/newsequentialid-transact-sql?view=sql-server-ver15), NEWSEQUENTIALID "Creates a GUID that is greater than any GUID previously generated by this function on a specified computer since Windows was started. After restarting Windows, the GUID can start again from a lower range, but is still globally unique."
Why do you say that there is no guarantee of global uniqueness?
Hi Simon, yes, it says you won't get a smaller GUID than any other one generated on that machine. But that's not global. What if there are multiple machines involved? But the main issue is why on earth, if you're going to the DB to get a value, that you'd want a GUID in the first place? That's a fundamental design problem. GUID values make sense when they're generated by apps outside the DB. The app can then put the value into the DB confident that it won't collide. But if the DB is generating the value in the first place, there's zero benefit to it being a GUID or uniqueidentifier data type. Just return an int or bigint instead, preferably using a sequence or an identity. (We prefer sequences today). And then you won't run into the fragmentation issue that you're seeing with NEWID().
"what advantage does a pseudo-GUID give me over just getting a bigint? "
— It doesn't. That's not the point. When you have an existing table in production, that uses Guid's and causes 99.9% fragmentation, you can use newsequentialid as a bandaid to alleviate the fragmention problem.
An Int/BigInt will always have far superior performance to a guid. There is also another option, for existing prod tables with guid cluster id… have both columns, a clustered int id and a guid as a non-clustered index. You get your performance, and you can keep the guid.
Agreed that it could be used as a bandaid for poorly designed systems that are suffering issues. But I see people using it in new designs. That is not sensible. A big part of the problem is that people also assume that their physical data model needs to match their logical data model. If you're going to have to have GUIDs (and usually the rationale for that is poor), the physical design could be a single table that maps GUID values to bigints. Then build the whole DB with bigints, and only one table gets messed up constantly and needs maintenance.
For all those preaching about the fragmentation horrors of Random GUIDs or the wonders of NEWSEQUENTIALID(), you're in for big surprise. To whet your whistle, I have inserted 100,000 rows into a Random GUID Keyed Clustered Index for 58 days (that's 5.8 MILLION Rows, folks!) for 58 straight days with less than 1% fragmentation with no index maintenance until the very last day when it finally went a bit over 1% fragmentation. In one of the most heterodoxical twists of knowledge that there's even been in the world of SQL Server, I've actually used Random GUID indexes to PREVENT FRAGMENTATION!
In nearly as large a twist of knowledge, the real reason why Random GUIDs seem to have a fragmentation problem is because nearly the whole world has adopted supposed "Best Practices" for Index Maintenance that aren't "Best Practices", were never meant to be "Best Practices", and actually are a very "WORST PRACTICE" for most indexes that fragment save for a very few types of patterns.
Please see the following presentation for the proof of all that.
https://www.youtube.com/watch?v=qfQtY17bPQ4
Is it not worth creating a memory guidcreator table with a clustered pk int identity 1,1 and a default NEWSEQUENTIALID () guid and linking to that? on every insert you insert into that first, get your int back and use that in your other table pks, and fks. if you need the guid its a quick find. you can also easily lookup the int.
If you are going to the DB to get a value though, why get a GUID at all? And particularly one that just points to an int. Just get an int in the first place. What value would getting a GUID (and particularly a pretend GUID) add?