Normalize a database

The goal of normalization is to remove the following dangerous features from the database:

  • Duplication of data
  • Inconsistent data
  • Ambiguous data

Normalization starts off as a mechanical process where you just follow the rules but these rules can lead to a ridiculous end result. The final design does need to be adjusted by an intelligent hand.

Stages of normalization

There are six stages to Normalization but the first three - known as the 1st, 2nd, and 3rd Normal Forms - are adequate for most situations. The other stages - Boyce-Codd, 4th and 5th - are more relevant to academic study and even then it can be difficult to generate examples to show the need for them.

1st Normal Form (1NF)

  • There is a Primary Key that uniquely identifies each record.
  • There are no repeating fields - you don't have a series of similar fields named 'Item1', 'Item2', 'Item3', etc.
  • Remove any such fields into a separate table and include a foreign key that refers back to the matching record in the parent table.

2nd Normal Form (2NF)

  • The table is in 1st Normal Form and all fields depend on the entire primary key. This happens automatically if you are using a simple primary key based on a single field.
  • If you are using a composite key made up of two or more fields then you have to check that no other field relies on just one of them.
  • Remove any such fields into a separate table, together with that part of the primary key on which they depend.

3rd Normal Form (3NF)

  • The table in 2nd Normal Form and no fields depend on anything but the primary key.
  • Remove any such fields to another table, leaving the key in the original table.

Result of normalization

The process starts with all your information in a single flat table and normalization gives you a number of smaller tables that hold the same information but which hold it in an efficient and unambiguous way. The process so far has been mechanical - in fact Microsoft Access has a Wizard to do it for you automatically - but there are three factors which need some intelligence to avoid unnecessary complications.

One-to-one relations

Normalization will never generate two tables with a one-to-one relationship between them. There is no theoretical reason to separate a single entity like this with some fields in a single record of one table and others in a single record of another table but you might want to split a table into two for practical reasons in order to:

  • reduce the number of fields in a table and meet some limit in the programming language.
  • store large and rarely-used fields in a separate table so that you do not waste time transferring and processing them during routine operations.
  • separate sensitive information from commonly-used fields so that the tables can be stored and backed-up separately.

Many-to-many relations

A many-to-many relationship cannot be implemented directly in a relational database and has to be represented by an intermediate table that shows the many links between the two tables. Sometimes this intermediate table will just hold these two foreign keys; for example the Customer ID and the Product Code where you have to represent the situation of many customers, each of whom are placing many orders. Sometimes more information will be held, for example the discount offered on this product when ordered by this customer.

Normalization will tell you that Customer and Product information should be separated from the original flat Orders table because every customer places many orders and every product is included in many orders. Normalization will not tell you how best to implement the link between these tables.

Denormalization

Normalized tables store data as efficiently and as safely as possible without duplication but you may need to denormalize in order to:

  • make reporting easier for non-technical staff
  • create a Data Warehouse
  • improve the speed of execution
  • reduce the length of an SQL statement and comply with language restrictions.

Remember that the theory of normalization was developed at a time when a megabyte of disk storage space cost over £100. Prices now are less than a tenth of a penny a megabyte and there is no longer the pressure for such extreme efficiency. Processor time has also become very much cheaper and developer time is now much more expensive than either.

As an extreme example, you could normalize an address table and realise that the same county name occurred many times. According to the rules you would separate the county names into their own table with a CountyID key in the address table. But you would then realise that the town names also repeat and they need their own table too. As do the street names and the postcodes and the telephone numbers. The rules have no discretion and no common sense. If a single postcode occurs twice in the main table then it has been repeated and so postcodes must be stored in a separate table.

The normalized database now consists of county, town, district, street, house name, house number, postcode and telephone tables. This might save disk space but you would need an awkward and inefficient query to join these eight tables every time that you wanted to generate a customer address. This normalization is correct in theory but totally wrong in practice.