Duplicate names and addresses
Removing duplicates is one of the more difficult tasks
when you're working with lists of names and addresses. There
are three ways of doing this - four if we include the
simple-minded approach of leaving it all to the user:
- Let the user do it
- Help the user during data entry
- Give the user some semi-automatic tools
- Develop a fully-automatic solution
Whichever method you use, you will not get anywhere until you
have a
consistent layout
for your address data.
Different businesses have different views of duplicates so you
must get managerial support for your strategy. A business
sending out cheap mailshots to an unsophisticated audience can
afford duplicates. A business sending out samples of high-value
products will be more demanding.
Let the user do it
This can work with a good operator who is the sole user of a
small data set that changes slowly. If there are only a thousand
or so records then the user will become familiar with the names
and will notice if two names are similar.
You may need to provide some extra tools to modify other tables
in the database before a duplicate can be removed. For example, the
orders that were assigned to the duplicate entry must be reassigned
before the duplicate can be removed.
Help the user
Even the best user cannot be expected to remember all the entries
in a large table or in one that changes rapidly. Duplicates are
difficult to remove once they become established so the best time
to catch them is at the time of data entry. Show the user a list of
possible duplicates before committing this new record to the table.
A search on the following fields will pick up the majority of
potential duplicates:
- Surname + Initial
- Postcode
- Phone number
- email
If you can catch the duplicates before they get entered into the
table then you do not have to do any tidying up afterwards.
Semi-automatic tools
If you are checking for duplicates at the time of data entry then
you must do it quickly. The list of possible duplicates must appear
in less than a second, especially if your user is taking details
from a new customer by phone. You can build a tool that does a more
thorough search and include it on the maintenance menu.
Consider adding the following additional tests as a maintenance tool:
- Names that sound similar eg 'Philips' and 'Phillips'
- Compound surnames with spaces and hyphens removed
- Alternative first names eg 'Bill', 'Will', and 'William'
- Same surname at the same address
The Soundex algorithm generates an alphnumeric code based on the
pronunciation of the name and can be used for phonetic matches. It is
however not as helpful as it might seem. For example, 'Cawley' and
'Crawley' might easily be duplicates but they map to codes of 'C400' and
'C640' respectively and would not be picked up by Soundex. On the other
hand, 'Brighthelmstone' and 'Brigadoon' both map to a code of 'B623' and
so would be considered possible duplicates by a Soundex match.
If you do build such a housekeeping tool then it might also be a good
place to look for the predictable joke names such as 'M. Mouse' and
'B. Bunny'.
Automatic tools
It is very difficult to remove duplicates automatically. You
must have a rigid data format as a starting point and be confident
that names of towns, counties and countries are spelt consistently.
You will also have to write algorithms that can identify house names,
road names, and suburbs and extract these components of the addresses.
These routines must be robust enough to cope with addresses that range
from the minimal:
Batheaston House
Bath
BA1 1AA
to the baroque:
Flat 2a
"The Lodge"
The Batheaston House Inn
High House Lane
Off Bath Road
Batheaston
Near Bath
Bath and North East Somerset
Are these duplicates? Only a local could tell you.
|