How to implement and benefit from relational integrity between database tables
How Relational Integrity can simplify your database project, shorten development time, and improve the quality of your life.
Originally published on evolt.org as RI to the Rescue!!.
Unless you are a seasoned database pro, "relational integrity" might be just another fancy marketing phrase that you don't really understand but are vaguely suspicious of. If the following explanation helps you, I will be very, very happy -- you don't even have to tell me; I'll know. And if you'd like more information, please contact me; I do database work for a living, answer email for free, and would love to arrange some onsite training or consulting at your company.
To illustrate relational integrity concepts, I use the following sample database:
articles such as stories, columns, opinions, recipes, quotes, ideas, or similar items of written content
authors who write articles
This database could be for an ezine (an electronic magazine web site), a famous quotes search engine, an online library, or any application which has to keep track of content (articles) by owner (author).
If you already know what RI is, please feel free to skip over the following basic explanations.
What is a Relational Integrity?
A relational database contains tables of data which are related to each other. For example, articles are related to the author who wrote them, and conversely, authors are related to the articles they wrote. (Sounds obvious, and it is.) That's the relational part.
When we work with a relational database, we naturally expect that data in related tables stay related. For example, articles written by MartinB should always be related to MartinB and never be confused with articles written by any other author. That's the integrity part.
Relational Integrity is also called Referential Integrity, perhaps because the mechanism for ensuring integrity is implemented by the way the tables reference each other. Both terms are okay with me. I usually just say RI anyway.
How RI Works
Tables are related using data values. Usually, these are the values of numeric keys instead of text fields. (Database designers love assigning numeric keys, mainly because it's more efficient than relating tables through text fields, but also because most databases can generate autonumbers or sequence numbers for this purpose.)
So let's say MartinB is an author in the database and has author number 32. Further, let's say that for some reason (which we won't go into here, as it's not important) you need to change MartinB's author number from 32 to 45678. The way it works is that you change his author number on his author record in the author table. And then you had better also change any references to author 32 in all related tables -- like the articles table -- because if you don't, they'll still refer to author number 32, which no longer exists.
Mercifully, you don't have to do this yourself. This is where RI comes in. RI can do it for you.
The underlying structure and enforcement of RI is usually handled by the relational database system. I say usually because in all database systems I've ever worked with, RI is optional, absurd as that seems. Database developers have to turn RI on, or else ask the developers to provide external enforcement procedures outside the database system.
So if you don't have RI turned on for your particular database, then when the day comes -- and you know it will -- when you have to make a change and you forget the external procedure(s) to change all the necessary related data values, the database will be said to have lost integrity. You will have "broken links" in your tables. You will be up the creek. (Next April 1, mention to a DBA that there are broken links in his database, and watch his face...)
You will, I am certain, by this point have guessed that RI can be a very useful feature to employ in your database.
Let us take one more brief moment to cover some basic definitions before getting down to business. Again, please feel free to skip over these definitions.
Parent and Child
Every database relation is a one-to-many relation involving just two tables, a parent table and a child table. Yes, you may hear other terms like one-to-one, many-to-many, and so on, but these are all just special cases of the one-to-many relation. Don't let them fool you. Understanding databases is easy if you keep it simple.
So each relation is one-to-many, involving a parent table and a child table. This terminology is not standard, merely convenient, if somewhat unfortunate. Some people say master and detail; sometimes, I'll say owner and member, a flashback from pre-relational CODASYL databases as out of date as bell-bottom pants (except that unlike pre-relational databases, bell-bottoms may one day be back in style).
As you might guess, the parent is the "one" and the child is the "many." Each parent can have many children. The reverse, I am afraid, is not true, not in a relational database. A child can have only one parent. Please don't dwell too long on the obvious fact that in real life a child has two parents -- you will only get confused. In a relational database, in a one-to-many relation, each child has at most one parent. This is important. This is also why the parent-child terminology is unfortunate.
Let's look at our sample database. Each author is related one or more articles, the articles written by that author. The author table is the parent table, and the article table is the child table.
Note that an author can write many articles, but an article has only one author. This is a natural consequence of it being a one-to-many relation. The scenario where an article is co-authored by more than one author involves a many-to-many relation, the contemplation of which, like the reality that in real life a child has more than one parent, is beyond the scope of this discussion; it will only confuse you now.
Primary and Foreign Keys
Relational databases implement RI using primary keys and foreign keys. Other than their special role in RI, these "keys" are like any other database fields -- they contain data values.
The parent table contains the primary key and the child table contains the foreign key which references (or "points" to, if you prefer) the parent table.
Because each child relates to only one parent, the child table needs only one foreign key. So where are the "many" child records in a one-to-many relation? They're in the multiple rows of the child table that all have the same data value in their foreign key, all referencing the same parent.
For example, consider the following articles:
|32||Your clients need a CMS|
|The Tao of Testing|
|88||640 x 480 Isn't Dead Yet|
|Inside the evolt.org Rebuild|
Above, you can see the one-to-many relation clearly. Now look at it the way you normally see it actually stored in the article table:
|Your clients need a CMS||32|
|640 x 480 Isn't Dead Yet||88|
|The Tao of Testing||32|
|Inside the evolt.org Rebuild||88|
Now you can see more clearly that each article is related to only one author. This is the "many-to-one" reversal of the one-to-many relationship. It amounts to the same thing. You can always say something about a one-to-many relation in two different ways.
Finally, we get down to business.
In a relational database, there are four basic actions you can perform on a table:
Select one or more existing rows from the table
Insert one or more new rows into the table
Update one or more rows in the table
Delete one or more rows from the table
These actions, known as SIUD, are carried out by the -- wait for it -- Select, Insert, Update, and Delete SQL statements. (These actions are sometimes, as in business process modelling, also called CRUD, using the initials for Create, Reference, Update, and Delete. CRUD is easier to remember, but not as helpful as SIUD, because you may end up getting confused. For example, there's no Reference SQL statement, but there is a Create SQL statement, although it doesn't create any data. I'm sorry I even mentioned CRUD. I just wanted to set the record straight, point out again that simplicity rules, and make sure you were awake.)
Three of these actions -- Insert, Update, and Delete -- change the contents of the database table and therefore have RI implications. Remember, you want to stay on top of changes, preferably letting the database itself handle them, to ensure you never have "broken links" in your data.
Now it gets a wee bit complicated.
For every relation, the Insert, Update, and Delete actions can be performed on either the primary key (in the parent table), or the foreign key (in the child table). There are thus a total of six possible RI actions for every one-to-many relation:
|Parent Insert||Child Insert|
|Parent Update||Child Update|
|Parent Delete||Child Delete|
Each of these six possible RI actions needs to be examined when you design your database, because what you want to do is turn over as much of the enforcement of RI as possible to the database itself.
Now it gets a little more complicated.
For each of the six RI actions mentioned above, one of five RI rules can be implemented:
|Cascade||perform same action on related keys|
|Set Null||set related keys to null|
|Set Default||set related keys to their default value|
|None||allow action, do nothing to related keys|
Does this mean that implementing RI requires choosing one out of a possible thirty different combinations of actions for each one-to-many relation?? The answer is yes -- but I hasten to add that it isn't really as bad or as much work as it may seem!!
In fact, most database systems do not let you define all of the above rules. Where you cannot define the rule, it typically defaults either to Restrict or None.
Sample RI Action Rules
Let's say you want to insert a new article, i.e. a child insert action. In most databases, the Restrict rule would be in effect (assuming you have defined the foreign key properly, i.e. RI has been turned on). What this means is that you cannot insert an article for an author that doesn't exist; you are restricted from doing so.
This is typical behaviour for most one-to-many relations. You have to add the parent to the database first, and then you can add child records that reference that parent. In fact, if you wanted some other rule for Child Insert, you would find it extremely tricky to implement. Trust me, it's messy. Luckily, this requirement hardly ever comes up. I shouldn't even have mentioned it. Of the five possible actions for Child Insert, only Restrict really makes sense.
Okay, I lied. Actually, there is one other action that makes sense, in a macabre sort of way, only because you see it all the time, not that you would really want to define it this way on purpose (assuming you have been paying attention).
If you do not turn on RI for the author-article one-to-many relation, the Child Insert action defaults to None. In other words, you can add an article that references any author number you like, the database will not check it for you, and who knows, it might result in a broken link. So leaving RI undefined for a given relation results in Child Insert: None.
Maintaining the author-article relation involves six different actions. Below, I have chosen the RI rules that "make the most sense" to me. See if you agree or disagree with my comments:
Author Insert: None
insert new author, do nothing to related articles
This is fairly trivial, since new authors wouldn't have any articles yet (assuming other sensible RI rules are in effect).
Author Update: Cascade
update the author's primary key and also all matching article foreign keys
This is the crucial action we've been talking about that you want the database to do for you, and while you might not need to change a record's primary key often, when you need to do it, you want Cascade.
Author Delete: Restrict
delete author only if no articles have been written by this author
If you allowed the author delete to happen and left the articles in there, they would all have broken links; Restrict prevents this. Alternatively, if you specified the Author Delete rule as Set Null, the author would be gone and the articles would be left behind but you wouldn't know who wrote them; this is the only other choice that makes sense to me, but maybe not as much as Restrict. (And I'm sorry, I really cannot get into a discussion of nulls at this point.)
Article Insert: Restrict
insert new article only if the author already exists
This is the example discussed in detail earlier.
Article Update: Restrict
update article to a different author but only if the new author also already exists
Sometimes this is required to correct an input error.
Article Delete: None
delete article and do nothing to author
This makes sense, right? Note that if the article being deleted is the author's last or only one, this may result in an author with no articles -- but that's not the same thing as a broken link.
Options for Implementing RI
If you made it this far, congratulations -- things are about to get a whole lot simpler.
Standard SQL-92 syntax provides a mechanism for setting RI rules only for the parent delete and parent update actions. All others take a default. The rules are defined when declaring the foreign key on the child table:
CREATE TABLE child ( columndefinitions,
[constraintname] FOREIGN KEY (fkcolumns)
REFERENCES parent [(pkcolumns)]
CASCADE | SET NULL | SET DEFAULT | NO ACTION
CASCADE | SET NULL | SET DEFAULT | NO ACTION
As mentioned, all other rules take defaults, and if you do not define the parent delete and parent update rules, they take defaults too:
None of these defaults is very surprising, when you think about it. In fact, the only action for the author-article relation that we thought made sense to be different from the default is Author Update, which we decided we wanted as Cascade. Luckily, that is one of the rules that can be easily set by the SQL syntax.
So what's the big deal, then? If all the defaults are good, and we can change the ones we want, how will RI simplify your database project, shorten development time, and improve the quality of your life?
The answer comes when you consider how some applications go about maintaining relational integrity when RI is not turned on in the database.
Suppose the application wanted to insert an article, and also wanted to make sure the author already exists in the database. It is not difficult to imagine a developer coding up something like this --
do a SELECT on the author table using the author number for the proposed article insert as the search key
if the select fails, issue an error message such as "author does not exist; article insert rejected"
otherwise, proceed to INSERT the article
There is nothing inherently wrong with this approach, other than that
it requires an additional database call, which is often much less efficient, and
it takes extra time to do this programming
With RI turned on and Article Insert set to Restrict (the default), the developer simply codes
INSERT the article
test the database return code, and if it failed on an RI exception, issue the error message
It is not even always necessary to test the return code and issue a friendly error message. Assuming the foreign key constraint is named properly, the database error message will basically say something like "ERROR: integrity constraint ARTICLEAUTHOR violated - parent key not found" and this is often enough to indicate what's going on.
When designing a database, after all the entities, tables, columns and data types have been identified, you will have a number of one-to-many relations. Examine each relation in light of the six possible actions, and the five possible rules for each. Start by considering all the default rules. Pay special attention to the parent update and parent delete rules, because those can be changed easily, and often are. Then prepare a rule table for each relation, similar to the samples we described earlier for the author-article relation:
Highlight the rules that are different from the default, as above. Then pass these rule tables as specifications to your developer.
You'll save a lot of hassles, development time, and unnecessary coding. And you might just see an increase in the quality of your life.
I am indebted to ERwin, a popular data modelling tool, for first introducing me to the concept of RI actions outside the scope of what is normally defined by SQL syntax. You would not believe what you have to do, and what ERwin can generate in the way of triggers and stored procedures, to achieve non-default RI rules for actions other than parent update and parent delete. If your project requires these, give me a shout.
"Improving the quality of life" is a requirement that should be part of every database specification. I was taught this in my very first database design course for IDMS in 1977 by William Casey of Cullinane Corporation, and it's as true today as it was then.