First Name Variations in a Database

Tags:

I am trying to determine what the best way is to find variations of a first name in a database. For example, I search for Bill Smith. I would like it return "Bill Smith", obviously, but I would also like it to return "William Smith", or "Billy Smith", or even "Willy Smith". My initial thought was to build a first name hierarchy, but I do not know where I could obtain such data, if it even exists.

Since users can search the directory, I thought this would be a key feature. For example, people I went to school with called me Joe, but I always go by Joseph now. So, I was looking at doing a phonetic search on the last name, either with NYSIIS or Double Metaphone and then searching on the first name using this name heirarchy. Is there a better way to do this - maybe some sort of graded relevance using a full text search on the full name instead of a two part search on the first and last name? Part of me thinks that if I stored a name as a single value instead of multiple values, it might facilitate more search options at the expense of being able to address a user by the first name.

As far as platform, I am using SQL Server 2005 - however, I don't have a problem shifting some of the matching into the code; for example, pre-seeding the phonetic keys for a user, since they wouldn't change.

Any thoughts or guidance would be appreciated. Countless searches have pretty much turned up empty. Thanks!

Edit: It seems that there are two very distinct camps on the functionality and I am definitely sitting in the middle right now. I could see the argument of a full-text search - most likely done with a lack of data normalization, and a multi-part approach that uses different criteria for different parts of the name.

The problem ultimately comes down to user intent. The Bill / William example is a good one, because it shows the mutation of a first name based upon the formality of the usage. I think that building a name hierarchy is the more accurate (and extensible) solution, but is going to be far more complex. The fuzzy search approach is easier to implement at the expense of accuracy. Is this a fair comparison?

Resolution: Upon doing some tests, I have determined to go with an approach where the initial registration will take a full name and I will split it out into multiple fields (forename, surname, middle, suffix, etc.). Since I am sure that it won't be perfect, I will allow the user to edit the "parts", including adding a maiden or alternate name. As far as searching goes, with either solution I am going to need to maintain what variations exists, either in a database table, or as a thesaurus. Neither have an advantage over the other in this case. I think it is going to come down to performance, and I will have to actually run some benchmarks to determine which is best. Thank you, everyone, for your input!

266

asked Feb 22 '09 15:02

Joseph Ferris

2 Answers

In my opinion you should either do a feature right and make it complete, or you should leave it off to avoid building a half-assed intelligence into a computer program that still gets it wrong most of the time ("Looks like you're writing a letter", anyone?).

In case of human names, a computer will get it wrong most of the time, doing it right and complete is impossible, IMHO. Maybe you can hack something that does the most common English names. But actually, the intelligence to look for both "Bill" and "William" is built into almost any English speaking person - I would leave it to them to connect the dots.

105

answered Sep 24 '22 12:09

Tomalak

The term you are looking for is Hypocorism:

http://en.wikipedia.org/wiki/Hypocorism

And Wikipedia lists many of them. You could bang out some Python or Perl to scrape that page and put it in a db.

I would go with a structure like this:

create table given_names (
  id int primary key,
  name text not null unique
);

create table hypocorisms (
  id int references given_names(id),
  name text not null,

  primary key (id, name)
);

insert into given_names values (1, 'William');
insert into hypocorisms values (1, 'Bill');
insert into hypocorisms values (1, 'Billy');

Then you could write a function/sproc to normalize a name:

normalize_given_name('Bill'); --returns William

One issue you will face is that different names can have the same hypocorism (Albert -> Al, Alan -> Al)

answered Sep 23 '22 12:09

Neil McGuigan

Related questions
                            
                                What database for crawler/scraper?
                            
                                How can I get enum possible values in a MySQL database using php? [duplicate]
                            
                                truncated LISTAGG string [duplicate]
                            
                                unsafe assembly permission was denied on object 'server' database 'master' [closed]
                            
                                SUM() all results (no group by clause)
                            
                                How to get part of the string that matched with regular expression in Oracle SQL
                            
                                SQL Server, ISABOUT, weighted terms
                            
                                SQL Azure - One session locking entire DB for Update and Insert
                            
                                How to optimize query that has the same subquery twice?
                            
                                SQLite, check if TEXT field has any alphabetical chars in it
                            
                                subquery uses ungrouped column "i.date_time" from outer query
                            
                                Aggregate continuous ranges of dates
                            
                                SQL Remove All Constraints Azure Friendly
                            
                                SQL Number - Row_Number() - Allow Repeating Row Number
                            
                                Hibernate Check Annotation
                            
                                How to inner-join in Excel (eg. using VLOOKUP)
                            
                                Merging Complicated Tables
                            
                                How to disable elements in a grid
                            
                                Postgres INSERT INTO with SELECT ordering
                            
                                Trying to store XML content into SQL Server 2005 fails (encoding problem)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

First Name Variations in a Database

Tags:

algorithm

sql

sql-server

data-structures

database-design

Joseph Ferris

People also ask

2 Answers

Tomalak

Neil McGuigan

Recent Activity

Donate For Us