I spend a lot of time reviewing database designs at customer sites. So often, I come across bizarre decisions that have been taken in the choice of data types within the databases. One problem that I see all the time is the way that email addresses have been stored.
One data element or more?
The first thing you need to decide is whether an email address is really one data element or more.
One school of thought (based on strict normalization) is that the local part (i.e. the part in front of the @ symbol) could be stored separately to the domain part (i.e. the part after the @ symbol). I could see arguments for that, or perhaps even for separating components of the domain part, in some sort of purist view, but I don't do that.
I think of an email address as a single data element.
Data type and length?
I then see variations on whether the email address is stored as varchar or nvarchar. And, I see variations on lengths.
I see people using 60 characters, 100 characters, 150 characters, or whatever they guessed might be big enough. Some even allow 1000 characters. Sadly, I often also see max data types used (i.e. varchar(max)). Even more sadly, I see the length defined differently in different columns within the database.
Just how long should you allow for email addresses?
There's a standard for that
The correct answer, as it is with many data modelling questions, is to see if there is a standard that defines what's needed. And of course, for email addresses, there is a standard.
RFC 5322 defines email addresses. You'll find it here.
Email addresses are made up of:
- local part (up to 64 characters)
- @ sign (1 character)
- domain part (up to 255 characters)
That means that, in theory, valid email addresses could have been up to 320 characters.
However, as Brian Lawton noted in the comments to this post, there is a further restriction on the overall total. RFC 2821 (here) restricts the total length to 254, as clarified in the errata published here. Interestingly, the errata noted that they had said 255 but then clarified it to 254.
And as for varchar vs nvarchar, the RFC had different allowed characters for each part of the email address but all the allowable characters for email addresses are all contained in the varchar range.
However, since 2012 and another RFC 6530 (here), we've had the ability to use non-ASCII values in both parts. Original email servers would not have allowed values like this:
But servers updated to handle RFC 6530, 6531, 6532 requirements do permit this.
So the appropriate format today, is to allow up to 254 characters.
PostgreSQL – the easy option
Now if you are using a database engine like PostgreSQL, you could just define the data type as varchar(254) as in PostgreSQL, the number in the brackets is the number of characters. For me, that's entirely sensible.
SQL Server – not so easy
However, for SQL Server, as Anup reminded me below, things are more complicated than they should be. The number in brackets for nvarchar is the number of byte-pairs, and for varchar it's the number of bytes. Personally I think this is a broken implementation and would prefer it worked the way PostgreSQL does.
For nvarchar, when using SC (supplementary character) collations, a single character can take 2 or 4 byte pairs.
That means you'd need to allow for 2 or 4 byte characters for everything except the @. That means the address could occupy 507 byte pairs. (That's 254 * 2 + 1). So you could specify it as nvarchar(507).
If you were sure that no SC collation was being used, it could be 254 byte-pairs, so:
For UTF-8 data in SQL Server, each character could be from one to four bytes long, so the value could be up to 1017 bytes (254 * 4 + 1) in length. That would make it varchar(1017) when using a UTF-8 collation.
But the overall length !
And even though, the values could occupy up to that many byte-pairs or bytes, you still might want to limit it to 254 characters in total. With SQL Server, you'd need to do that with a check constraint. But because the LEN function is also broken (it ignores trailing spaces), you can't just say this:
Email Address varchar(1017)
CHECK LEN(EmailAddress) <= 254
To really check the value, you need to do this instead:
Email Address varchar(1017)
CHECK LEN(EmailAddress + 'x') <= 255
You can see why I prefer the PostgreSQL approach here.