Natural or Surrogate Key
Autonumbers, technical keys, intelligent keys too
Great discussion on WebDesign-L today about database tables, including one-to-many and many-to-many designs, and the use of keys. The opportunity to talk about country code came up -- or perhaps it was just my imagination. I mentioned it anyway, because country code is a great example of a natural key. In the wild, country code is almost always a foreign key, referencing the usually non-existent country table.
A recent article of mine, Using generated numbers for primary keys, part 2 also mentions country code. I favour surrogate keys in most cases, but I love country code as an example of a natural key. Being of good open mind (a requisite skill for a data modeller), I also included links to other articles:
Since then, I've found additional articles on natural and surrogate keys. SQLteam.com has Identity and Primary Keys - Part I, which includes "some very good reasons NOT to use an Identity column." Understanding Identity Columns is a practical how-to for SQL/Server identity columns.
- The primary key must uniquely identify each record.
- A record's primary-key value can't be null.
- The primary key-value must exist when the record is created.
- The primary key must remain stable -- you can't change the primary-key field(s).
- The primary key must be compact and contain the fewest possible attributes.
- The primary-key value can't be changed.
There are six points there, right? First, unique. Check. Second, not null. Check. But the third point is the same as the second -- the primary key cannot be null. The fourth and sixth points are sort of the same One of them deals with redesign, which is not really germane, and the other with changing values. As for compactness, "fewest possible" is either an unclear reference to normalization, or a general design principle nobody would argue with. So really, there are only three rules and a suggestion. And one of the rules is wrong! Primary keys do change. Both their design, when that's required, and their values. It happens all the time. But the article says "Note that the word must in the above list doesnâ€™t mean perhaps or most of the time -- must is absolute."
There's a nice chart on page 2 of the article, but it perpetuates the kind of thinking just shown. At the end, the authors say "To avoid future problems and subsequent (and perhaps convoluted) repairs, we recommend that you use surrogate keys." I happen to agree with this recommendation, but not with the way they arrived at it. And apparently they've never migrated surrogate keys from one database to another, because that's painful as well as convoluted. There are several dozen lively comments posted by readers at the end of the article, ranging from "totally agree" to "flat wrong."
Two other articles on the same site also disappointed me recently. Intelligent keys aren't that dumb presents a good argument for not designing a key with "built-in" meaning, thus seeming to contradict the title. For example, don't use a 1 as part of the order number to mean the first order for a customer that day. No trained data modeller would do this, though. The article concludes that when you "finally discard their precious intelligent key, [the users] will recognize it as an intelligent decision." Well, sure. But in reaching this conclusion, the author has trampled all over the concept of a natural key. "The main key -- the primary key -- should never contain data entered by users." The article never actually uses the term surrogate key, but it's there. "If the users need an invoice number to print on a piece of paper for a warehouse worker, then give it to them. However, the database ought to handle all the hidden data according to the rules of normalization." There's good advice in the article, especially about dealing with users during a database design project, but the argument for surrogate keys is weak.
Finally, I recently read the Builder.com article Database design for platform independence. It starts out with great intentions, but immediately makes two suggestions that I would call "howlers" -- don't use stored procedures and don't use joins. I can understand where the author is coming from, but that's ridiculous, and as one person stated in the comments posted to the article, it shows "how a corporate dictum like 'All database applications must be independent of vendor specific features' is not only impractical but dangerous. What pointy headed manager came up with this?"
Even more hilarious were the descriptions of the types of joins (as if that mattered, since you shouldn't be using them) --
Joins are powerful SQL commands for creating tables from data that is retrieved from multiple sources.
Equi-Joins are formed by retrieving all the data from two separate sources and combining it into one, large table.
Inner Joins are joined on the inner columns of two tables. Outer Joins are joined on the outer columns of two tables.
Left Joins are joined on the left columns of two tables. Right Joins are joined on the right columns of two tables.
There are also other kinds of joins -- left/inner, left/outer, right/inner, and right/outer.
In the comments posted after the article, the author gets duly reamed for those definitions. It's sad, really.
There was a nice attempt to identify "safe" datatypes. Anyone who has migrated an autonumber or identity field to an Oracle sequence number will know the effect that this can have on SQL. But would you avoid using a generated number just to make the application "cross-database"? That'd be silly. Even the author says some datatypes, like Microsoft Access’ AUTONUMBER, "are quite handy."
If you have any comments or questions about natural or surrogate keys, please contact me.