SQL: Think that varchar(10) means 10 characters ? If so, think again…

If you read almost any book on the SQL language, you'll see definitions where:

varchar(n)

means a varying length character data type, and where n is the number of characters it can store.

SQL Server 2019 changes things

If that's how you've seen it, SQL Server 2019 is going to change your understanding. The product team have pointed out that that n was really the number of bytes, and that "it never was the number of characters".

I'd almost guarantee that if I look back at older Microsoft documents and courses, they'll say that it was the number of characters.

Now, given the only characters we could store in a varchar were ASCII characters with one byte per character, it's hardly surprising that everyone thinks of the n as the number of characters.

But in SQL Server 2019 and the introduction of UTF-8 based collations that can be stored in varchar, a single character can be one, two, three, or four bytes. Note that we're talking about varchar here, and not nvarchar.

So if you have a varchar(20) column now, you have no idea how many characters can fit in it, unless you're only using ASCII characters and not ever using UTF-8. (But keep in mind that UTF-8 is pretty popular right now).

SQL Server 2012 already changed nvarchar

Most people also thought that nvarchar(10) meant 10 Unicode characters could be stored. Again the product team pointed out that the 10 meant 10 byte-pairs, not 10 double-byte characters.

SQL Server 2012 introduced SC (Supplementary Character) collations and this meant that a single character could be 2 bytes or 4 bytes when you're using nvarchar. So once again, ever since then, you don't know how many characters can actually fit in an nvarchar(10) data element.

Now I have to say that "byte-pairs" seems a pretty pointless measurement system to me. When a single Unicode character was a byte-pair in size, fair enough, but now…???

Other Databases?

As an example, if you look at the documentation page for strings in PostgreSQL (they've been natively UTF-8 based for a long time), they say:

Both char(n) and varchar(n) can store up to n characters in length. If you try to store a longer string in the column that is either char(n) or varchar(n), PostgreSQL will issue an error.

And that's how I wish it still worked in SQL Server as well. But it doesn't.

So what to do?

In the next post, I'll look at how I think I'm going to handle string data in SQL Server from now on.

 

3 thoughts on “SQL: Think that varchar(10) means 10 characters ? If so, think again…”

  1. Hi Greg,
    Good catch… From https://docs.microsoft.com/en-us/sql/t-sql/data-types/precision-scale-and-length-transact-sql?view=sql-server-2017:

    In SQL Server, the default maximum precision of numeric and decimal data types is 38. In earlier versions of SQL Server, the default maximum is 28.

    Length for a numeric data type is the number of bytes that are used to store the number. Length for a character string or Unicode data type is the number of characters. The length for binary, varbinary, and image data types is the number of bytes. For example, an int data type can hold 10 digits, is stored in 4 bytes, and doesn't accept decimal points. The int data type has a precision of 10, a length of 4, and a scale of 0.

    When concatenating two char, varchar, binary, or varbinary expressions, the length of the resulting expression is the sum of the lengths of the two source expressions, up to 8,000 characters.

    When concatenating two nchar or nvarchar expressions, the length of the resulting expression is the sum of the lengths of the two source expressions, up to 4,000 characters.

    1. Great catch for you too. Yes that page is now clearly wrong, and it's just the start of all the ones that will need to be fixed.

Leave a Reply

Your email address will not be published. Required fields are marked *