There is an endless discussion in the development community about whether databases should include declared foreign key constraints or not.
As a consultant, I regularly visit a wide variety of client sites. Many of these have applications designed without constraints. When I ask why there are no constraints, the response is invariably one of the following:
- The app ensures that it's right
- They don't work well with our application development
- They are too slow
- What's a foreign key constraint?
Most of the sites that I work with have sizeable databases. This discussion is important in those situations.
The app ensures that it's right
In most enterprises, most large databases end up being used by more than one single application, and often each application will be made up of many sub-applications or modules. That means that every place that accesses the data needs to apply the same rules. (The same issue applies to other forms of constraints as well).
Worse, the databases often end up being accessed by applications from different authors, and often with different technology stacks.
ETL processes are often used to move data into the databases or to update that data.
Even worse, in real production scenarios, data-fixes often get applied directly to the database.
A view of the world that says that it will be ok because everything goes through a single application is a very narrow view of the world.
They don't work well with our application development
What this really boils down to in most cases is that the developers didn't want to work out the order for updates and it's just easier for them without constraints as they can update any table they want, in any order.
That means that the data is periodically in invalid states and (hopefully) eventually consistent. But it also means that the data has many interim states that are actually invalid states. What happens with concurrent access at that point? What happens to a reporting application that finds invoices for customers that don't exist?
It would be helpful if SQL Server supported deferred constraint checking but today it doesn't. I've been formally asking them for it for over 10 years: https://connect.microsoft.com/SQLServer/feedback/details/124728/option-to-defer-foreign-key-constraint-checking-until-transaction-commit
I still think it's one of the most important enhancements that could be made to SQL Server from a development point of view.
They are too slow
So often I'm told "we can't do it because it would be too slow". Yet almost every time I ask if they've actually tested it, I'm told "no".
The reality is usually that someone's brother's cousin's friend read it somewhere on the Internet so they decided it would be a problem.
Whenever I really test it, I find very little impact as long as appropriate indexing is in place, and sensible options are chosen when bulk importing data.
I do occasionally find specific constraints where I decide to disable them but they are few and far between. And even then, I like to see the constraint still be in place (so it can be discovered by tools, etc. ) but just not checked. It certainly never applies to all constraints in a database.
What's a foreign key constraint?
Sadly this is also a common question, fortunately mostly only on smaller databases and applications. Some developers really just don't understand the issue but this is not most developers.
Bottom Line
When I do a detailed check on a system that has run without constraints for quite a while, I almost always find data integrity issues.
When I show the integrity issues to the customer, they look surprised then say something like "ah, that's right, we had that bug a few weeks back". Pretty much every time, there are some issues somewhere.
That's ok if you're building a toy application; not ok if you're building a large financial application.
At the very least, sites that decide not to have constraints should have a process in place that periodically checks data integrity. Most sites that don't have constraints don't do this either.
It's important to find issues as soon as they occur. Finding issues well after they have been created always makes them much harder to fix. At least with constraints, you get instant feedback that something's wrong. If you have a bug in your ETL process that is messing up your data in some subtle way, you don't want to find that out several weeks later after other processing has occurred on that data.
Summary
I like to see constraints in place until it's proven that a particular constraint can't be in place for a performance reason. Then, I like to see it (and only it) disabled rather than removed. A bonus is that it can usually still be discovered by tools.
This is part of the long-running debate (feud) regarding whether business rules should be included in the database. Constraints are business rules, essentially.
The debate goes away (or should) when one differentiates between data-oriented business rules and interaction-oriented business rules. Data-oriented business rules should be enforced as closely to the data as possible (datatypes, constraints, triggers, sprocs and views, in that general order) and interaction-oriented business rules should be enforced as close to the interface as possible.
In the end, data types are constraints too, so where should the line be drawn? I like your distinction of data-oriented business rules vs interaction-oriented. Given most architectures have at least 3 layers though, I'm not sure that I'd call the others interaction-oriented. Perhaps interaction and application logic oriented?
Interestingly, most of existing BIG ERPs do NOT run with referential constraints.
SAP does not:
"However, many Enterprise Database Applications (like SAP) have their own proprietary metadata repository (such as the SAP Data Dictionary, or DDIC) that allows for self-management of referential integrity relationships, cluster tables, etc. The self-management means it does not rely on the RDBMS to provide the Declarative Referential Integrity (DRI) to manage the Primary Key and Foreign Key relationships. Almost all SAP database tables have Primary Key constraints specified but without any foreign keys that are associated to them. This is significant in that there is no way for the database to assess the logical consistency between tables and their data when those relationships aren't defined. That can only be done within the realm of the SAP Application module itself."
from here:
https://blogs.msdn.microsoft.com/saponsqlserver/2013/10/30/corruption-handling-in-sap-databases/
MS Dynamics AX also does NOT have referential constraints in database – 2500+ tables and no ref checks. And many others are the same. I worked for mid-market ERP company, we had 1200+ tables solution – with no ref constraints.
Wherever you work, most likely your paycheck was not computed and stored with referential constraints… That really SUXXXX… These systems handle financial data and run big businesses..
(continuing)
hard to advocate for ref constraints in smaller apps when such an examples are well known.
The cause I think is because these big systems are built using special proprietary languages (ABAP for SAP, X++ for dynamics AX, etc). And these languages just did not give much support to 'delayed save-all' which allows data access layer to proper sort the updates. Every record is saved individually and immediately from code; the main reason for this is reliance of identities as PK, so we need record ID immediately, to start linking other entities to it. Using sequences (instead of identities) would have helped, but sequences appeared in SQL Server only recently.
So it all rolled out over the years into the mess we have now. And the result is of course – real mess in data; orphan records, pointers pointing to nowhere, etc. This is in real ERP apps that run big multi-nationals!
One quite easy fix could be delayed ref constraint check that you mention – then all this stuff would start working (or most of it).
But MS stupidly does not understand the importance of this SQL-92 feature. Sad….
The split is data integrity vs. business rules; referential integrity being an example of the former. An Invoice without a Customer is incoherent (cannot be usefully interpreted or accounted for). "This status is only valid for this customer type under these conditions" is a business rule, and does not impair the ability to interpret and use the data even if violated.
Data integrity is the responsibility of the data people. If you do not have data people- well, there's your problem!
Many people don't use it because of performance issues…
They create the unique index on the column to build the FK but forget to build the "inverse" index…
Imagine a Taxes table… Usually the table has few rows… very few (10, 20, 100 tops..). The invoice line table has a taxId and has lots and lots of rows (lets say 10.000.000 for a medium database..). The FK only requires an unique index on the taxId of the Taxes table… It doesn't require and index on the invoice line table..
If you delete a tax from the Taxes table the database will do a scan on the entire invoice line table to check if it's used.. Our ERP had this situation..
Whenever you tried to delete a tax it took more than 10minutes… We created the "reverse" index and it was immediate…
Hi Jason, for that reason, when we review system designs, if we find a declared foreign key where the key columns aren't the left-most component of at least one non-clustered index, we consider that a code-smell.
It's why I've previously blogged that Microsoft should create default indexes on foreign key constraints unless you use the "I know what I'm doing" option.
I strongly agree with you Greg. Many years ago I published a scrip to automatically generate indexes for FKs.
http://sqlblog.com/blogs/paul_nielsen/archive/2007/02/08/codegen-to-create-indexes-for-fks.aspx
I was surprised when another very well known SQL MVP told me later that he thought is was a stupid idea to assume that FKs should have indexes. It's a very rare FK that doesn't benefit from an index.
I received this query via email:
I have a question. Been a dba for years, prior job had performance issues with high volume inserts, I found a few fk's were slowing things from the redundant checking and turned them off because the code handled the quality, things sped up. In my current job we have perf complaints, basically every single table has them, one table had 200+(fk's on one table), with the top 13 tables having over 500 fk's(an average of 42 per table). They now temporarily disable them for some processes, which sped things up, so clearly there is a cost to them. I've been pushing to have the models simplified and keys for critical, and not every table. I am fine with that model but but can there be too many fk's?
In general I find very few FKs that slow performance in enough of a notable way to made me decide to trade that for integrity. Even then, I only disable (not delete) the ones that are *proven* to be a problem.
However, if I had tables with 200 or more FKs on a single table, I'd suspect I had much bigger problems than the FKs. The model seems to be the real issue in your case.