Normalize a database
The goal of normalization is to remove the following dangerous features
from the database:
- Duplication of data
- Inconsistent data
- Ambiguous data
Normalization starts off as a mechanical process where you just follow
the rules but these rules can lead to a ridiculous end result. The
final design does need to be adjusted by an intelligent hand.
Stages of normalization
There are six stages to Normalization but the first three - known as
the 1st, 2nd, and 3rd Normal Forms - are adequate for most situations.
The other stages - Boyce-Codd, 4th and 5th - are more relevant to
academic study and even then it can be difficult to generate examples
to show the need for them.
1st Normal Form (1NF)
-
There is a Primary Key that uniquely identifies each record.
-
There are no repeating fields - you don't have a series of
similar fields named 'Item1', 'Item2', 'Item3', etc.
-
Remove any such fields into a separate table and include a foreign
key that refers back to the matching record in the parent table.
2nd Normal Form (2NF)
-
The table is in 1st Normal Form and all fields depend on the entire
primary key. This happens automatically if you are using a simple
primary key based on a single field.
-
If you are using a composite key made up of two or more fields then
you have to check that no other field relies on just one of them.
-
Remove any such fields into a separate table, together with that
part of the primary key on which they depend.
3rd Normal Form (3NF)
-
The table in 2nd Normal Form and no fields depend on anything but
the primary key.
-
Remove any such fields to another table, leaving the key in the
original table.
Result of normalization
The process starts with all your information in a single flat table and
normalization gives you a number of smaller tables that hold the same
information but which hold it in an efficient and unambiguous way. The
process so far has been mechanical - in fact Microsoft Access has a
Wizard to do it for you automatically - but there are three factors
which need some intelligence to avoid unnecessary complications.
One-to-one relations
Normalization will never generate two tables with a one-to-one
relationship between them. There is no theoretical reason to separate
a single entity like this with some fields in a single record of one table
and others in a single record of another table but you might want to split
a table into two for practical reasons in order to:
-
reduce the number of fields in a table and meet some limit in the
programming language.
-
store large and rarely-used fields in a separate table so that you do
not waste time transferring and processing them during routine
operations.
-
separate sensitive information from commonly-used fields so that the
tables can be stored and backed-up separately.
Many-to-many relations
A many-to-many relationship cannot be implemented directly in a
relational database and has to be represented by an intermediate table
that shows the many links between the two tables. Sometimes this
intermediate table will just hold these two foreign keys; for example
the Customer ID and the Product Code where you have to represent the
situation of many customers, each of whom are placing many orders.
Sometimes more information will be held, for example the discount offered
on this product when ordered by this customer.
Normalization will tell you that Customer and Product information should
be separated from the original flat Orders table because every customer
places many orders and every product is included in many orders.
Normalization will not tell you how best to implement the link between
these tables.
Denormalization
Normalized tables store data as efficiently and as safely as possible
without duplication but you may need to denormalize in order to:
- make reporting easier for non-technical staff
- create a Data Warehouse
- improve the speed of execution
-
reduce the length of an SQL statement and comply with language
restrictions.
Remember that the theory of normalization was developed at a time when a
megabyte of disk storage space cost over £100. Prices now are less
than a tenth of a penny a megabyte and there is no longer the pressure for
such extreme efficiency. Processor time has also become very much cheaper
and developer time is now much more expensive than either.
As an extreme example, you could normalize an address table and realise
that the same county name occurred many times. According to the rules you
would separate the county names into their own table with a CountyID key
in the address table. But you would then realise that the town names also
repeat and they need their own table too. As do the street names and
the postcodes and the telephone numbers. The rules have no discretion
and no common sense. If a single postcode occurs twice in the main table
then it has been repeated and so postcodes must be stored in a separate
table.
The normalized database now consists of county, town, district, street,
house name, house number, postcode and telephone tables. This might
save disk space but you would need an awkward and inefficient query to
join these eight tables every time that you wanted to generate a customer
address. This normalization is correct in theory but totally wrong in
practice.
|