SQL: Think that varchar(10) means 10 characters ? If so, think again!

If you read almost any book on the SQL language, you’ll see definitions where:
varchar(n)
means a varying length character data type, and where n is the number of characters it can store.
SQL Server 2019 changes things
If that’s how you’ve seen it, SQL Server 2019 is going to change your understanding. The product team have pointed out that that n was really the number of bytes, and that “it never was the number of characters”.
I’d almost guarantee that if I look back at older Microsoft documents and courses, they’ll say that it was the number of characters.
Now, given the only characters we could store in a varchar were ASCII characters with one byte per character, it’s hardly surprising that everyone thinks of the n as the number of characters.
But in SQL Server 2019 and the introduction of UTF-8 based collations that can be stored in varchar, a single character can be one, two, three, or four bytes. Note that we’re talking about varchar here, and not nvarchar.
So if you have a varchar(20) column now, you have no idea how many characters can fit in it, unless you’re only using ASCII characters and not ever using UTF-8. (But keep in mind that UTF-8 is pretty popular right now).
SQL Server 2012 already changed nvarchar
Most people also thought that nvarchar(10) meant 10 Unicode characters could be stored. Again the product team pointed out that the 10 meant 10 byte-pairs, not 10 double-byte characters.
SQL Server 2012 introduced SC (Supplementary Character) collations and this meant that a single character could be 2 bytes or 4 bytes when you’re using nvarchar. So once again, ever since then, you don’t know how many characters can actually fit in an nvarchar(10) data element.
Now I have to say that “byte-pairs” seems a pretty pointless measurement system to me. When a single Unicode character was a byte-pair in size, fair enough, but now…???
Other Databases?
As an example, if you look at the documentation page for strings in PostgreSQL (they’ve been natively UTF-8 based for a long time), they say that both char(n) and varchar(n) can store up to n characters in length.
And that’s how I wish it still worked in SQL Server as well. But it doesn’t.
So what to do?
In the next post, I’ll look at how I think I’m going to handle string data in SQL Server from now on.
2019-07-25