SQL: Think that varchar(10) means 10 characters ? If so, think again…

If you read almost any book on the SQL language, you'll see definitions where:

varchar(n)

means a varying length character data type, and where n is the number of characters it can store.

SQL Server 2019 changes things

If that's how you've seen it, SQL Server 2019 is going to change your understanding. The product team have pointed out that that n was really the number of bytes, and that "it never was the number of characters".

I'd almost guarantee that if I look back at older Microsoft documents and courses, they'll say that it was the number of characters.

Now, given the only characters we could store in a varchar were ASCII characters with one byte per character, it's hardly surprising that everyone thinks of the n as the number of characters.

But in SQL Server 2019 and the introduction of UTF-8 based collations that can be stored in varchar, a single character can be one, two, three, or four bytes. Note that we're talking about varchar here, and not nvarchar.

So if you have a varchar(20) column now, you have no idea how many characters can fit in it, unless you're only using ASCII characters and not ever using UTF-8. (But keep in mind that UTF-8 is pretty popular right now).

SQL Server 2012 already changed nvarchar

Most people also thought that nvarchar(10) meant 10 Unicode characters could be stored. Again the product team pointed out that the 10 meant 10 byte-pairs, not 10 double-byte characters.

SQL Server 2012 introduced SC (Supplementary Character) collations and this meant that a single character could be 2 bytes or 4 bytes when you're using nvarchar. So once again, ever since then, you don't know how many characters can actually fit in an nvarchar(10) data element.

Now I have to say that "byte-pairs" seems a pretty pointless measurement system to me. When a single Unicode character was a byte-pair in size, fair enough, but now…???

Other Databases?

As an example, if you look at the documentation page for strings in PostgreSQL (they've been natively UTF-8 based for a long time), they say that both char(n) and varchar(n) can store up to n characters in length.

And that's how I wish it still worked in SQL Server as well. But it doesn't.

So what to do?

In the next post, I'll look at how I think I'm going to handle string data in SQL Server from now on.

 

9 thoughts on “SQL: Think that varchar(10) means 10 characters ? If so, think again…”

  1. Hi Greg,
    Good catch… From https://docs.microsoft.com/en-us/sql/t-sql/data-types/precision-scale-and-length-transact-sql?view=sql-server-2017:

    In SQL Server, the default maximum precision of numeric and decimal data types is 38. In earlier versions of SQL Server, the default maximum is 28.

    Length for a numeric data type is the number of bytes that are used to store the number. Length for a character string or Unicode data type is the number of characters. The length for binary, varbinary, and image data types is the number of bytes. For example, an int data type can hold 10 digits, is stored in 4 bytes, and doesn't accept decimal points. The int data type has a precision of 10, a length of 4, and a scale of 0.

    When concatenating two char, varchar, binary, or varbinary expressions, the length of the resulting expression is the sum of the lengths of the two source expressions, up to 8,000 characters.

    When concatenating two nchar or nvarchar expressions, the length of the resulting expression is the sum of the lengths of the two source expressions, up to 4,000 characters.

    1. Great catch for you too. Yes that page is now clearly wrong, and it's just the start of all the ones that will need to be fixed.

  2. Hi Greg. I am not wanting to be negative, but there are several technical inaccuracies here:

    1) "where n is the number of characters it can store.": If this is how most books phrase it (and I'm not saying you're incorrect as I haven't looked through that many SQL Server books), then they're all wrong _unless_ the book is describing SQL Server 7.0 or older.

    2) "given the only characters we could store in a varchar were ASCII characters with one byte per character": This has not been true for a long time. SQL Server 2000 introduced the Windows collations, some of which are Double-Byte Character Sets (DBCS) for Chinese, Japanese, and Korean. Those code pages use either 1 or 2 bytes per character, in the VARCHAR, CHAR, and the deprecated TEXT datatypes.

    3) "if you have a varchar(20) column now, you have no idea how many characters can fit in it, unless you're only using ASCII characters": How are you defining "ASCII"? Do you mean Standard ASCII (i.e. the first 128 values/code points, 0 – 127), or Extended ASCII (i.e. all 256 values, 0 – 255)? If you mean Standard ASCII, then sure. But when including the "extended" characters (all of which are still 1 byte in non-DBCS code pages), then this statement is no longer accurate. Several code pages (including Hebrew and Arabic) have combining characters that are used in the formation of a single displayable character. For example, the following shows a Hebrew letter, Shin, that could either be a "Sin" (an "s" sound) if there is a dot on the upper-left, or a "Shin" (an "sh" sound) if there is a dot on the upper-right of the letter. There are 3 code points being used: the base "Shin" letter (0xF9), the "Sin dot" combining character (0xC9), and then a Latin upper-case "H" (0x48), yet it's only two "characters":

    sql
    DECLARE @CP1255 TABLE ([Sin] VARCHAR(10) COLLATE Hebrew_100_CI_AS);
    INSERT INTO @CP1255 ([Sin]) VALUES (0xF9C948);
    SELECT [Sin], LEN([Sin]) AS [Len], DATALENGTH([Sin]) AS [DataLength] FROM @CP1255;
    -- שֹH 3 3

    4) "keep in mind that UTF-8 is pretty popular right now":

    True, UTF-8 is the most common encoding used on the Web (and even some OSs), but that doesn't imply that it should be used. UTF-8 is wonderful in some cases, and absolutely horrible in others.

    5) "nvarchar(10) … the product team pointed out that the 10 meant 10 byte-pairs, not 10 double-byte characters.": I am guessing that you are misquoting them here, or misunderstanding what they said, because that statement does not make sense. Except for the surrogate code points, a "byte-pair" is a "double-byte character". I believe they are trying to simplify the language here but I don't think it's a good idea. What they mean is that each BMP character (the first 65,536 characters) are all encoded in UTF-16 as single 2-byte "code units". Supplementary characters (i.e. all remaining code points of the 1,114,111 possible) are encoded in UTF-16 as 2 code units (i.e. a surrogate pair). So in NVARCHAR(10), you get up to 10 code units (i.e. 20 bytes). Whereas in UTF-8 the code units are 1-byte, so VARCHAR(10) means you get up to 10 code units (i.e. 10 bytes).

    6) "SQL Server 2012 introduced SC (Supplementary Character) collations and this meant that a single character could be 2 bytes or 4 bytes when you're using nvarchar.": While correct that SQL Server 2012 introduced the "_SC" collations, it is incorrect that it had any effect on what could be stored in NVARCHAR datatypes (columns or variables). I don't have either SQL Server 7.0 or 2000 to test with, but you should be able to store supplementary characters without a problem in both versions. The "_SC" collations merely allow the built-in functions to properly deal with supplementary characters as individual code points. The official documentation is incorrect about these collations in a few places, and when I have time I will submit corrections.

    7) "if you look at the documentation page for strings in PostgreSQL": What you linked to is not the official documentation. The actual documentation ( http://www.postgresql.org/docs/12/datatype-character.html ) states:

    > The storage requirement for a short string (up to 126 bytes) is 1 byte plus the actual string, which includes the space padding in the case of character. Longer strings have 4 bytes of overhead instead of 1. Long strings are compressed by the system automatically, so the physical requirement on disk might be less.

    That sounds interesting, but is quite a different approach than how SQL Server handles storage of strings. And, I would guess that they really mean "code point", which still leaves you with the potential for arguing with a developer or end user about whether "character" means a single code point or a single displayable "grapheme" (to put in Unicode terms).

    For more info on how characters are stored in SQL Server, please see:
    "How Many Bytes Per Character in SQL Server: a Completely Complete Guide" ( https://sqlquantumleap.com/2019/11/22/how-many-bytes-per-character-in-sql-server-a-completely-complete-guide/ )

    Take care,
    Solomon…

    1. Hi Solomon,

      Thanks for the detailed rationale.

      But regardless of the internal storage, and regardless of how it currently works, the bigger issue is how it should work, and what would be useful to a developer using it. For me, apart from some pretty weird scenarios, unless datatype(n) means that n characters fit in that datatype (regardless of encoding), you might as well just have a generic string datatype.

      Regards,

      Greg

Leave a Reply

Your email address will not be published. Required fields are marked *