Candidate Keys: Choosing the Right Primary Key
It is the business that has the knowledge about their entities (things of importance) and what kind of unique information that separate them (invoice numbers, customer IDs, and so on).
Your role as an analyst/designer is to validate such properties, and to fit them into a normalized database model that reflects the needs of the business.
During this, you will discover many candidate keys and issues concerning how primary keys should be defined. With wrong primary key decisions, chances are high that you will have a worse job in normalizing the database model, as well as achieve a high performing database in the end. (Read our separate eBook on database normalization.).
Three absolute demands on the candidate primary key, if it is to be regarded as a possible primary key.
There are three fundamental demands on the candidate keys that we must never deviate from, if it is to become the subject for a primary key:
1. The candidate key must be unique within its domain (the entity it represents, and beyond, if you also intend to access external entities).
2. The candidate key can not hold NULL values (NULL is not zero. Zero is a number. NULL is 'nonexistent value').
3. The candidate key can never change. It must hold the same value for a given occurrence of an entity for the lifetime of that entity.
In 1., we say that the main purpose of the candidate key is to help us to identify one single row in a table, regardless of whether there exists billions of row. This sets high demands on the flexibility in terms of delivering uniqueness.
In 2., we say that a candidate key always, without exception, must hold a value. Without it, we break post 1.
In 3., we say that no matter what happens, inside our business or outside of it, we are not allowed to change the value of the candidate key. Adding an attribute (expanding) the candidate key, would pose a change. Some say that you should be able to change the primary key, and do a 'cascade update' to all underlying tables.
We will deal with that in a later chapter.
If the candidate key can withstand all these three demands, we have most likely identified our primary key. As we proceed in this eBook, we will keep these three demands in our minds.
You may also find interest in reading 'A Relational Model of Data for Large Shared Data Banks' by E.F. Codd himself (the founder/co-founder of the relational theory): A Relational Model of Data for Large Shared Data Banks
Notations in graphic examples.
Throughout this eBook, we will be using logical models; Entity Relationship diagrams, as well as physical models; what is called Server diagrams in Oracle Designer. We will also exemplify the different constructions with SELECT statements where appropriate. First, a few words about notation in the logical model:
In this figure, a # in front of an attribute means the attribute is (part of) the primary key. Likewise is the horizontal bar over the crow-foot in the drawing of the relationship between departments and accounts a sign that the primary key from departments is also a part of the primary key in the accounts entity. How smart that is, is a totally different issue: (ACCOUNTS is breaking 2nd Normal Form: Name is dependent on ACCOUNT_NO alone, not DEPT_ID).
A * in front of an attribute signs that this attribute is mandatory (always must have a value filled in), while an o in front of an attribute tells that it is optional to give the attribute a value. Likewise with the physical model:
Notice the difference between the two graphics: in the physical model in Fig.3, the relationship in Fig. 2 is incorporated in the accounts table. The relationship in Fig. 3 now indicates that there exists an unconditional constraint between the two tables: You are not allowed to enter a value into DEPT_ID in accounts if it doesn't exist in the departments table. The horizontal line above the crow-bar does not necessarily indicate participation in the primary key of that table: only the # tells you if a column is a part of the primary key.
By the way, Fig. numbers do not start at Fig. 1, because this is just an excerpt from the eBook. you can check it out here: Primary and Foreign Keys.
We have also used the plural form on entity names as well as table names. The normal is singular form for entities, and plural for tables. We have done this exception for reading purposes only.
Details on Uniqueness
A primary key is, as stated earlier in post 1, a column, or a collection of columns, that together give us the ability to uniquely read one single row from a table, no matter how many rows that exists in that table. In the logical world this is just theory, but in the physical, we get into a lot of trouble if we are unable to do that.
Consider a simple example:
You want to give a 10% pay raise to your top DBA, John Wilson.
You do that by writing:
SET SALARY = SALARY * 1.1
WHERE NAME = 'John Wilson';
If you have two employees with that name, they both get a pay raise. Not exactly what you intended...
Names are very unlikely candidates for primary keys.
You need something unique, and names are not unique. Never.
Details on null values
In post 2, about not null values in a primary key: How can you identify something you don't know? This point should go without saying, but I bring it up, because, believe or not, I have read articles claiming that there exists the possibility of a weak primary key; meaning null values are allowed. Absolutely nonsense!
All meaning of the term primary key vanishes into thin air (or above). Most serious database systems will actually disclose your attempts and deny you to go further on that path, and with very good reason.
If you ever come across this situation, just defy it. It is a path of failure. Actually, it is a non-sense discussion.
Details on no change
Think about Africa. After 2. World War, and the end of colonialism, how many countries have changed their names? A more recent example: There is no longer a country called The Union of Soviet Socialist Republics; Today we call it Russia.
We even have standards for naming countries: the former Russia was known all over the world as USSR, and Russia is abbreviated as RUS today (or is it?). Look at these two fantastic tables:
Fig. 4: Country definitions from the United Nations
|410||Republic of Korea||KOR|
|498||Repulic of Moldova||MDA|
|659||Saint Kitts and Nevis||KNA|
|666||Saint Pierre and Miquelon||SPM|
|670||Saint Vincent and the Grenadines||VCT|
Fig:5: Another table of country definitions from the ISO standardization organization:
|FIPS 10-4||RP||ISO 3166||Internet|
|Saint Kitts and Nevis||SC||KN||KNA||659||.kn|
|Saint Pierre and Miquelon||SB||PM||SPM||666||.pm|
|Saint Vincent and the Grenadines||VC||VC||VCT||670||.vc|
Isn't this great?
Western Samoa is abbreviated WSM in the United Nations, and it is abbreviated WS is ISO! It has even 3 identifiers in ISO; WS, WSM and 882!
Take a look at USSR (Sorry, RUS): In United Nations they call it RUS, while ISO use RUS, RU, 643, and on Internet it's .ru. But according to FIPS 10-4 it should be abbreviated RS! How can we identify a country by using standards from these helpful people?
If you do a search on Google for the term 'standard country codes' (without the quotes), you will find the sites for these two tables among the first ten hits. Do not go to the ISO page; they actually want you to walk around with a shopping cart and pay for it.
In fig.4, every column is a candidate key: No two countries have the same name or any other shortcuts of the name, but they differ all the same, if you put on a broader perspective: In ISO, the name is Russia and the (latest determined standard) abbreviation is RU; but in the United Nations, the name for the same country is Russian Federation, and their abbreviation is RUS. Furthermore, in FIPS 10-4, the abbreviation is RS. Do they never meet to talk with each other?
Looking at it isolated, all columns in Fig. 5 are also candidate keys. But not if you want to match entries in Fig. 4 and Fig. 5.
Actually, I can find only one common denominator here: Both ISO and UN give Russia a common number: 643. Here is something to investigate. Are all the other abbreviations and names meant only for each organization, while they share the 643 number for identification? Have they been talking to each other, after all?
It should be clear by now that we cannot rely on any of the character columns in either organization: While each of them are candidate keys in their own world, we would soon get into trouble if we tried to use them to talk to all the world (that is, outside the scope of our little application).
As you can see, defining (or choosing) the right primary keys is not neccessarily a simple, straight-forward task.
How about another case:
You are a manufacturer, and one of your customers places an order on behalf of his customer, asking you to use his customer's customer number as identification: How long will it take before another of your customers does the same, and by sheer coincidence, they supply you with the same id (let us say cust_no 1000), but they are totally different companies?
You will have to organize your system in such a way that you can uniquely identify all information going into the system, regardless of external systems, and regardless of external changes. That's the bottom line: You have to make the primary key foolproof.