I spend a lot of time reviewing database designs at customer sites. So often, I come across bizarre decisions that have been taken in the choice of data types within the databases. One problem that I see all the time is the way that email addresses have been stored.
One data element or more?
The first thing you need to decide is whether an email address is really one data element or more.
One school of thought (based on strict normalization) is that the local part (i.e. the part in front of the @ symbol) could be stored separately to the domain part (i.e. the part after the @ symbol). I could see arguments for that, or perhaps even for separating components of the domain part, in some sort of purist view, but I don't do that.
I think of an email address as a single data element.
Data type and length?
I then see variations on whether the email address is stored as varchar or nvarchar. And, I see variations on lengths.
I see people using 60 characters, 100 characters, 150 characters, or whatever they guessed might be big enough. Some even allow 1000 characters. Sadly, I often also see max data types used (i.e. varchar(max)). Even more sadly, I see the length defined differently in different columns within the database.
Just how long should you allow for email addresses?
There's a standard for that
The correct answer, as it is with many data modelling questions, is to see if there is a standard that defines what's needed. And of course, for email addresses, there is a standard.
RFC 5322 defines email addresses. You'll find it here.
Email addresses are made up of:
- local part (up to 64 characters)
- @ sign (1 character)
- domain part (up to 255 characters)
That means that, in theory, valid email addresses could have been up to 320 characters.
However, as Brian Lawton noted in the comments to this post, there is a further restriction on the overall total. RFC 2821 (here) restricts the total length to 254, as clarified in the errata published here. Interestingly, the errata noted that they had said 255 but then clarified it to 254.
And as for varchar vs nvarchar, the RFC had different allowed characters for each part of the email address but all the allowable characters for email addresses are all contained in the varchar range.
However, since 2012 and another RFC 6530 (here), we've had the ability to use non-ASCII values in both parts. Original email servers would not have allowed values like this:
But servers updated to handle RFC 6530, 6531, 6532 requirements do permit this.
So the appropriate format today, when using SQL Server, would be nvarchar(254).
Is that what you've allowed for in your database designs? I wish I'd applied the current standard in the design of WideWorldImporters samples for SQL Server 2016. I used nvarchar(256), so I was close.