http://r937.com/20020620.html
Online Extra
Mike Lonigro
The Case for the Surrogate Key
A simple key to the flexible enterprise database
There is a simple, inexpensive, uniform element of database design that you are likely avoiding in favor of a complex, costly, and inconsistent one. This element is the surrogate or substitute primary key. It seems that designers avoid these independent keys like the plague. Instead, all but the most basic business entities are given keys made up of some series of attributes, borrowed keys, and sequence numbers . It was only after experiencing many problems at the hand of intelligent keys that I, too, became a believer in the flexibility and stability afforded by the substitute key.
In this article I'll illustrate why the surrogate primary key is a critically important ingredient in designing stable, flexible, and well-performing enterprise databases. I will define the issues, analyze them to arrive at a set of principles of primary key design, identify the criteria for defining a surrogate key on a table, suggest a standard for implementation, and finally, note some benefits our shop (J. B. Hunt Transport Inc.) enjoys from consistent application of a surrogate primary key.
THE ISSUES
The first issue is the function of the primary key. What exactly is the real purpose of the primary key? Because form follows function, defining the role of the primary key will settle much debate about its format, and it will help us determine some principles with which we can resolve the remaining issues.
Another issue is the distinction between the logical and physical keys. Can these keys be the same? For example, a customer identifier and date may be used to check for a duplicate order before entering it into a database. A sequence number may be added to distinguish line items one from another. But once the order and line items are entered into the database, should these data elements form the primary keys for these tables?
The meaningfulness of the primary key is another issue I'll discuss. How much information should the primary key hold? For example, a legacy system in our shop gave to each transport accident a primary key formed by combining the year with an accident sequence number and a type code. Another application assigned order numbers within ranges based on order type. Are these good choices for the primary key?
The final issue I will address is the source of the primary key. Can it be provided from outside the system or by a system user? Examples are Social Security numbers to identify people; user-generated pnemonic codes to identify customers and accounts; and catalog, part, or transaction numbers generated by external agencies. Why not use these identifiers as primary keys?
ANALYSIS
Proper primary key design recognizes that there are two unique keys to be considered. Database designers make a critical mistake when they fail to recognize the distinction between these keys. The logical unique key distinguishes entity occurrences from others to avoid duplicates before entry into the database. That is why this key is typically composed of attributes and relationships that the business can recognize as distinguishing occurrences logically from others. The primary key, however, is a physical design element that uniquely identifies rows after they are in the database. It is used in creating foreign keys in other tables in order to establish the logical relationships in the data model. In fact, it is best to consider a primary key's primary purpose to be establishing relationships in the database!
The primary key's role as a physical row identifier has important implications. First, it requires that the primary key be physically unique in the primary key table. Tables are sometimes created with primary keys that logically identify occurrences but may appear more than once in the primary key table. This repetition usually betrays a failure to define properly either the criteria for new occurrences of the entity or the real physical purpose of the table.
For example, a table may store work order types and identify them with a number or code. But the identifier may appear on more than one "inactive" row in the primary key table with perhaps one occurrence in an "active" status. In this case, the meaning of the work order type becomes unclear. Are all rows with the same key related to the original occurrence? Or are they truly different work order types that have passed their identifiers to newer occurrences? Alternatively, several rows sharing a primary key value may be distinguished by effective and end date ranges. In this case, the table is really acting as an audit of changes to the logical occurrence. But what table really represents the business entity itself?
Second, the primary key must never be null. This must be true even if any of the elements of the logical unique key are unknown. Like nonunique primary keys, null primary keys usually signal a failure to define clearly what the row represents. If we create an ORDER but fail to give it an order number until it is shipped, does the row represent the ORDER or the SHIPMENT? If we create an EMPLOYEE APPLICANT but do not assign a key until the application is received, are we equating the APPLICANT with the APPLICATION? What happens if we get a second APPLICATION from the same APPLICANT?
Third, the primary key must be absolutely stable. It must be protected in all cases from change. This does not mean protected from likely change; rather, it must be guaranteed not to be allowed to change. This third principle is always violated when an "intelligent" key is used as the primary key.
In the first example, the relationships and attributes that logically identify the order and line item are also used as the primary key (see Figure 1). One problem with these is that they can change. The order's customer may be incorrectly entered, for example. If we discover the error midway through processing the order, how do we correct it? There are two unpleasant options: We can either create a brand new order and cancel the original, or we can change the primary key value of the order. Suppose we choose the first option and create another order. Is this a true picture of business reality? That is, do we really have another order? Do we really want to reprocess the order under a new identifier?
1A If any attribute making up the primary key of the ORDER or ORDER LINE ITEM changes, all foreign key references to these tables (TABLE A, TABLE B, and TABLE C) must also be updated and committed together.
1B Here the ORDER and ORDER LINE ITEM tables have surrogate primary keys different than the unique keys. All foreign key references to them (TABLE A, TABLE B, and TABLE C) are insulated from changes to unique key attributes.
Figure 1. Eliminating the Effect of Change in Key Value.
The second option is to change the value of the order's primary key. In this case, we must also change every foreign key reference to the order within the same commit unit! Each new reference must also be included in this update. Would it not be more correct and efficient simply to change the order's customer and continue processing the order? A surrogate key independent of all the order's attributes or relationships allows us to correct the order and maintain the information already contained in the database easily.
Another problem illustrated here is that of concatenated primary keys. Because the order's primary key in our example has been migrated into the primary key of the line item, all references to the order line item must also change with a change in the order's key. The problem grows as we chain together more keys in the database. Migration of primary keys in this fashion has terrible implications for database flexibility.
These examples illustrate how primary keys must be insulated from changes in value and so should not comprise attributes, relationships, or primary keys migrated from other tables. But even if its value is truly stable, the primary key must never be the same as the logical unique key because the set of attributes and relationships that make an entity unique may itself change.
Associative entities provide a good example of this (see Figure 2). An associative entity contains attributes about a many-to-many relationship between two or more occurrences. Because the entity is defined by its relationships, we are tempted to use the foreign key columns in the primary key. For example, our company has a PART OFFERING associative entity, which stores pricing and configuration information for a PART as it is sold or offered by a VENDOR. A PART OFFERING is unique on the PART id, VENDOR id, and offering date. Why not use these columns as the primary key?
2A Here the PART OFFERING associative entity uses the logical unique key as the primary key. A change requiring PART OFFERINGs to be unique additionally on TIME requires changes to all related foreign key structures (TABLE A and TABLE B).
2B The PART OFFERING has a surrogate key different from the logical unique key. The new requirement to make PART OFFERING unique on an additional TIME attribute does not impact any foreign key structures (TABLE A and TABLE B).
Figure 2. Eliminating the Effect of Change in Key Structure.
The reason is that over time, the business may wish to capture PART OFFERINGs at a finer level of detail. We may begin to track multiple PART OFFERINGs for fuel offered by the same vendor at more than one price in a day. If we had used the unique key as the primary key, we would need to modify all programs and tables that reference unique PART OFFERINGs to include the offering time. A more stable design would give the PART OFFERING a surrogate key. The new business requirement would require only changes to the unique key index on the PART OFFERING table and any programs that insert or update its rows. The programs and other tables referencing the PART OFFERING by the surrogate key would be unaffected.
The third issue I raised involves primary keys that contain attributes unrelated to the logical unique key. This type of key appears in many forms and is subject to the same problems I addressed above. Additionally, these keys are especially inflexible to expansion of the definition or usage of data. New subtypes of orders or accidents, for example, are typically handled by assigning different ranges or formats of primary key values. As these applications increase, smaller key ranges cause rollover and archiving issues while applications are forced to come up with imaginative new identification schemes just to insert a row in the database.
Another example of this kind of key occurs when a key value implicitly represents a hierarchy among rows in a table. This type of key is inflexible to additional levels and limits the number of occurrences at any given level. For example, standard commodity codes used across the transportation industry are assigned key values whose format indicates an arbitrary level in a hierarchy of commodity classifications. A broad commodity classification is given a key of 100000; the classification level immediately beneath it has keys of the form 101000, 102000, and so on; and the next level keys of the form 101010, 101020. A better design would give each occurrence a surrogate primary key; child occurrences would have a foreign key back to the parent. In this way, classifications can have a varying and unlimited number of levels and occurrences.
Finally, primary key values must be internally assigned. The first reason is, again, due to stability. Social Security numbers may be miskeyed, unknown, or changed; business users may invent new coding schemes to identify customers; and outside agencies may reformat or reuse identifiers at their own will.
Another reason the values must be assigned internally is that we may define entities differently than the outside agencies that provide our data. For example, a vendor that sells us uniforms uses a different ITEM number to identify each different size of uniform. In our database, however, we create one ITEM with many sizes. This standard definition allows us to draw consistent information out of the database independent of vendor numbering schemes. Using the vendor's assigned ITEM number would cause us to create ITEMs based on each vendor's definition and not ours.
The best way to avoid the pitfalls indicated by these illustrations is to choose a primary key that is logically meaningless; this is the surrogate primary key. Doing this means, above all, that the primary key must be owned by the data administrator. Intelligent, problematic primary keys will be assigned when application developers or users begin to take ownership of the primary key. When this happens, the real purpose of the primary key is lost; it becomes a "good idea" number that tells a story about the row it identifies. The primary key is not a "good idea" number; its purpose is to physically identify a row to the database.
To summarize, the primary key must be:
Unique
Not null
Stable
Different than the logical unique key
Internally assigned
Logically meaningless.
THE CRITERIA
Does every table require a surrogate primary key? The answer is no. The prerequisites derive from the definition of the primary key. Because the primary key identifies a row after it is in the database, it follows that a table requires a primary key if its rows will be referenced individually by other tables after entry into the database.
There are two alternative views of this. The logical view asks: does the table participate in more than one relationship from the logical data model? For example, in a human resources database, an EMPLOYEE table will likely participate in more than one relationship; we may relate the EMPLOYEE to BENEFITs, POSITIONs, LOCATIONs, EVALUATIONs, and many other entities in our data model. For this reason, the EMPLOYEE gets a surrogate key (see Figure 3). But now suppose we need a table to store the EMPLOYEE DEPENDENTs. We must determine whether or not the DEPENDENT participates in more than one relationship from the data model. If we care to relate the DEPENDENT only to the EMPLOYEE, then the DEPENDENT table represents a repeating attribute of the EMPLOYEE entity. It does not require a surrogate key, but it may have a unique key consisting of the EMPLOYEE key and some combination of attributes of the DEPENDENT. If, however, we also wish to relate DEPENDENTs to BENEFITs, then the DEPENDENT participates in a second relationship, becomes an entity in the logical data model, and gets a surrogate key.
3A The DEPENDENT is related only to the EMPLOYEE, and has a unique key but no surrogate key.
3B The DEPENDENT has its own relationships (TABLE A and TABLE B) and gets a surrogate key.
Figure 3. The Data Model in Primary Key Assignment.
The physical view asks: Do we need to reference the rows individually for physical processing reasons? For example, we may wish to trigger certain DEPENDENTs to be processed through some application. Rather than reading all rows in the table looking for a few process flags, we may give the DEPENDENT a surrogate key and store the values of the "flagged" DEPENDENTs in a trigger table. After the DEPENDENT is processed, the row representing the DEPENDENT in the trigger table is deleted. This may be simpler than storing the series of columns that reference a single DEPENDENT.
A SUGGESTED STANDARD
Our shop has enjoyed success with the implementation of a simple primary key standard containing two critical elements. First, all our primary keys are sequentially assigned positive integers. In DB2, we use a large INTEGER column. This allows us to store more than 2.1 billion occurrences of any entity in our model. We can create an occurrence every second of every day for 68 years and still not run out of values. Certainly few business entities are created this frequently.
Second, all primary keys are assigned through an enterprise module that receives a logical table name, obtains an exclusive lock on a row containing that table's next primary key value, increments the value, and passes it back to the caller. The primary key is assigned by logical table because we sometimes split a logical table into two or more physical tables for performance. The rows that form the one logical row share their primary key value.
THE BENEFITS
The consistent format of primary keys throughout our database provides benefits in addition to flexibility. One is in the implementation of optional relationships (see Figure 4). For example, we create INVOICE ITEMs but may not relate them to an INVOICE until perhaps several days later. The INVOICE foreign key in the INVOICE ITEM table thus needs to be logically null. Physically, however, a zero or dummy INVOICE value would heavily skew the foreign key index. Instead, the null foreign key receives a value equal to the row's own primary key multiplied by negative one. This distributes the foreign key index values evenly across all rows, whether or not the foreign key relationship exists.
4 The INVOICE ITEM has an optional relationship to the INVOICE. ITEMs without a related INVOICE receive a foreign key value equal to their primary key multiplied by -1. The index over the INVOICE foreign key is evenly distributed.
Figure 4. Handling Absent Foreign Keys.
A second benefit is that "Exclusive OR" relationships can be handled with one well-indexed column (see Figure 5). An "Exclusive OR" relationship exists when an entity occurrence may relate to at most one of a number of possible occurrences. A given INVOICE ITEM, for example, may pertain to one of any number of different possible entities in our model; we may invoice SHIPMENTs, STOPs, ACTIVITIES, or others. Different primary key domains would require a foreign key for each possible relationship, all but one of which would contain a value. Instead, we use a single indexed foreign key column. A second column names the table referenced by the foreign key.
5 The INVOICE ITEM can bill a SHIPMENT, STOP, or ACTIVITY, exclusively. A single foreign key handles all relationships. A second column names the table containing the primary key value.
Figure 5. Handlign Exclusive or Foreign Keys.
Poor primary key design is perhaps the most repeated and damaging flaw in database design. We would do well to remember that a smart key is a dumb thing to do, while a dumb key is a smart thing to do.
Mike Lonigro is the senior data administrator responsible for all conceptual, logical, and physical database design at J. B. Hunt Transport Inc., the nation's largest publicly held truckload carrier. J.B. Hunt Transport runs on DB2 for OS/390 at more than 3 million transactions per day. Previously, Lonigro was a consultant with Price Waterhouse MCS in the Petroleum Special Practice Unit. You can reach him via email at lonigro@mail.jbhunt.com.
search - home - archives - contacts - site index
Copyright 1997 Miller Freeman Inc. All Rights Reserved
Redistribution without permission is prohibited.
Questions? Comments? We wo