If you've been working with databases for any length of time, you will have come across implementations of Entity-Attribute-Value (EAV) data models (or non-models as some of my friends would call them).
Instead of storing details of an entity as a standard relational table, rows are stored for each attribute.
For example, let's create a table of people:
When we query it, all is as expected:
And for a long time, this has been how we expect to create tables.
Now, if we changed the previous code to be EAV based, we could do this:
And when we query it, we'd see this:
Clearly this is going to be much harder to work with for all operations apart from reading and writing a list of values by the entity's ID.
So the first question that arises is:
Why would anyone want to have this sort of (non) design in the first place?
There are a few reasons. Some ok, some not ok.
Reason 1: Sparse Data
EAV tables first appeared for situations where there was a large number of potential attributes (ie: columns here) and each entity (ie: row here) only had a few of them. You can see a description of this in Wikipedia here: https://en.wikipedia.org/wiki/Entity%E2%80%93attribute%E2%80%93value_model
Products like SharePoint had this issue, and that's what pushed the SQL Server team to vastly increase the number of columns per table in SQL Server 2008. Prior to that version, you could have 1024 columns per table, and most people would have considered that if you needed more than that, you were doing something wrong. SQL Server 2008 increased the number of columns per table to 30,000, and also added the concept of a SPARSE column. Unlike standard columns that would always have occupied at least one bit in a NULL bitmap, SPARSE columns only occupy space when they contain data.
To avoid issues with indexes on the sparse columns, we also got filtered indexes in that version. They were highly useful for many reasons, not just for sparse columns.
The addition of SPARSE columns allowed SQL Server to use relational tables to get around the reasons that were the basis for the requirement for EAV tables.
Reason 2: Flexibility?
If we assume that we're not working with a sparse design, what's next? I often hear this type of design referred to as "flexible". When I quiz developers on what they mean by that, it basically boils down to:
I don't have to change the database schema when I add or remove attributes, or even tables.
So they have basically created a table to rule them all.
This simply isn't a good reason. It might stem from either working in organizations where someone else controls the database and they're difficult to work with, or don't want to make changes quickly enough. But that's fixing the wrong problem. It can also stem from a basic lack of design in the first place.
Reason 3: Object Repository
We only want to use the database as an object repository. All logic occurs outside the database. We just want to serialize objects in and out of the database. If you have this reason, I love you. You keep me in work, dealing with organizations with intractable performance issues. Fine for a toy application, not fine for an enterprise application.
Worse, this type of design tends to lead to information silos where the data storage is specific to the one version of one application. That's not what enterprises need. They typically have many applications that use the same data, and those applications are built on more than one technology stack.
More to Come
Now there are other reasons and I'll discuss them in the next article on this topic, when we discuss the pros and cons of this type of design, but I wanted to start with the definitions and the basic "why" question.
If you have other "why" options, please let me know in the comments.
16 thoughts on “SQL: Design – Entity Attribute Value Tables (Part 1) – Why?”
I know an ISV that has an EAV table in their database because they let customers and 3rd party apps integrate tightly with it, including adding their own attributes to data. In addition to the columns for E, A, and V, there's basically also a column for AttributeSource. (Values can be the customer's name, the 3rd party integration app, etc.)
The single table design means any 3rd party app is vaguely compatible with other 3rd party apps, and that customers can manage data that comes from any 3rd party app in the same way.
I'm not necessarily fond of it, but I understand why they did it that way.
Yes, but I often wonder why they still couldn't just use columns. Why create your own database within the database?
I do often see it for the "user configurable" data. But even then, it shouldn't apply to most of the database, just for those parts. But I regularly see it for almost the whole DB.
I hear justifications that "the customers need to be able to add some columns" and somehow that translated to "we need to store all our columns that way".
A (former) SQL Server person whose initials are HB once told me that one of the target use cases for XML columns and indexes was just this type of usage. Want to add an attribute or two but can't get the schema changed? Here's a bucket.
Hi Mark yes can be slthough the cool kids would now use a JSON column. For me, they eork best for very sparse columns eg a use by date for food when 99% of your products aren’t food. But even for that might prefer a column. Where I do like JSON columns is sparse array data eg perhaps languages spoken by a staff member.
Awesome Work there Explanation is awesome and the way of describing all the reasons in amazing
This old piece is as good a summary as any
with the newer version of SQL (& Oracle) there are some ladders you can add to the board of very long snakes (I guess this is the focus of part 2 🙂
So you're saying Slack is a toy application?
They use EAV for workspace preferences
There's a world of difference between a chat app and a typical data-oriented business app. Even then, I doubt that all their storage is in a single EAV based table. It'd be unworkable. Where those tables do work fine is for things like persistence of property bags, which sounds like what you might be referring to in "workspace preferences". The issue is whether or not any processing needs to happen within the database. If the only operations are "read all the values", "write all the values", "read one value", "write one value", then that's fine. For anything more complex, then no.
It's 2019, but Slack still don't have a REST API.
So, even if they're a great company it doesn't mean their tech is a showcase of best practices.
Yes, I'm always puzzled when I see "It must be a good practice. Look at how they've built this
/ / with it." It's the same with things like Twitter. Good scale, but I've often had issues with saving tweets, where it "just can't do it right now" or "something went wrong". That's OK for a chat app (maybe), not OK for most business apps. Unless you're discussing similar types of apps, the justifications are all pretty meaningless.
Adding columns in, say for a Machine parameter whereby different machines have different parameters , am I supposed to go edit the database and add a column in the table for each machine that has a parameter to be added to the table – shall the operation of adding that machine into the database because operator can not enter the parameter – seems ridiculous to me. A better route – I think would be to not have a datatype but declare the column type on result set or insert and not during storage. Thinking about it I guess FoxPro is looking like a pretty good solution in this area .. and I thought it had nothing useful to offer… woohoo ..
I sure would like to see that implementation of Sparse Columns as an alternative to E-A-V ..
Post a sample please ..
There's never one rule for everything. I don't know the situation with your machines, but let me give you an example of relays that are used in electricity substations. Within an organization that controls the substations, there can be a need to deal with thousands of different types of relays i.e. relays each with different properties/attributes. But the answer isn't to have a super wide table with all possible properties, or to go the other way and just have an EAV table. Even though they have different properties, there will be a bunch of things that are common to all relays i.e. location, brand, purchase date, etc. etc. They should be standard relational columns. But when you get to something like the settings for a relay, it doesn't makes sense to try to flatten all those out. Plus, some settings have other nested settings. So today, most people would put those settings in a lump of JSON or XML as a column within a standard relational table. That means that you can do all the normal querying about where things are, how long they've been there, etc. as relational queries. You can retrieve settings if you need them. And you can still query them if you really need to, but mostly you do this for data that is mostly just retrieve and isn't queried in the database.
say you have to store user preferences such as frequency at which email notifications should be sent.
Further assume, you are using Sso so you don't have a user's table as such. And, you use email ids to uniquely identify a user.
For a situation such as this EAV makes perfect sense.
I'm using EF/.net core 3.1 as an API & react on the client side.
Having a EAV for user preferences makes perfect sense.
Why wouldn't a value like "frequency at which email notifications be sent" be a normal table column? Would you never want to query that or report on that from within the DB?
And even if you have SSO, why wouldn't you have a users table, using the email address as an ID, rather than just planting the email address as yet another attribute in an EAV table?
The real test always is: "will I be querying this value within the database?" If so, chances are very high that it should be in a normal table. However, a set of app preferences that are never queried within the DB and only ever read/written as a whole, might make sense for an EAV table.
I use EAV for creating forms. The fields for the forms can be of any data type. I tried many options but the EAV looks the most stable.
Hi William, sure, that makes sense. The difference is that you're not really processing individual values within the DB. I presume it's largely just saving and retrieving sets of values.