I have an Oracle database that, like many, has a table containing biographical information. On which, I would like to search by name in a "natural" way.
The table has forename
and surname
fields and, currently, I am using something like this:
select id, forename, surname
from mytable
where upper(forename) like '%JOHN%'
and upper(surname) like '%SMITH%';
This works, but it can be very slow because the indices on this table obviously can't account for the preceding wildcard. Also, users will usually be searching for people based on what they tell them over the phone -- including a huge number of non-English names -- so it would be nice to also do some phonetic analysis.
As such, I have been experimenting with Oracle Text:
create index forenameFTX on mytable(forename) indextype is ctxsys.context;
create index surnameFTX on mytable(surname) indextype is ctxsys.context;
select score(1)+score(2) relevance,
id,
forename,
surname
from mytable
where contains(forename,'!%john%',1) > 0
and contains(surname,'!%smith%',2) > 0
order by relevance desc;
This has the advantage of using the Soundex algorithm as well as full text indices, so it should be a little more efficient. (Although, my anecdotal results show it to be pretty slow!) The only apprehensions I have about this are:
Firstly, the text indices need to be refreshed in some meaningful way. Using on commit
would be too slow and might interfere with how the frontend software -- which is out of my control -- interacts with the database; so requires some thinking about...
The results that are returned by Oracle aren't exactly very naturally sorted; I'm not really sure about this score
function. For example, my development data is showing "Jonathan Peter Jason Smith" at the top -- fine -- but also "Jane Margaret Simpson" at the same level as "John Terrance Smith"
I'm thinking that removing the preceding wildcard might improve performance without degrading the results as, in real life, you would never search for a chunk in the middle of a name. However, otherwise, I'm open to ideas... This scenario must have been implemented ad nauseam! Can anyone suggest a better approach to what I'm doing/considering now?
Thanks :)
I have come up with a solution which works pretty well, following the suggestions in the comments. Particularly, @X-Zero's suggestion of creating a table of Soundexes: In my case, I can create new tables, but altering the existing schema is not allowed!
So, my process is as follows:
Create a new table with columns: ID
, token
, sound
and position
; with the primary key over (ID
, sound
,position
) and an additional index over (ID
,sound
).
Go through each person in the biographical table:
Concatenate their forename and surname.
Change the codepage to us7ascii
, so accented characters are normalised. This is because the Soundex algorithm doesn't work with accented characters.
Convert all non-alphabetic characters into whitespace and consider this the boundary between tokens.
Tokenise this string and insert into the table the token (in lowercase), the Soundex of the token and the position the token comes in the original string; associate this with ID
.
Like so:
declare
nameString varchar2(82);
token varchar2(40);
posn integer;
cursor myNames is
select id,
forename||' '||surname person_name
from mypeople;
begin
for person in myNames
loop
nameString := trim(
utl_i18n.escape_reference(
regexp_replace(
regexp_replace(person.person_name,'[^[:alpha:]]',' '),
'\s+',' '),
'us7ascii')
)||' ';
posn := 1;
while nameString is not null
loop
token := substr(nameString,1,instr(nameString,' ') - 1);
insert into personsearch values (person.id,lower(token),soundex(token),posn);
nameString := substr(nameString,instr(nameString,' ') + 1);
posn := posn + 1;
end loop;
end loop;
end;
/
So, for example, "Siân O'Conner" gets tokenised into "sian" (position 1), "o" (position 2) and "conner" (position 3) and those three entries, with their Soundex, get inserted into personsearch
along with their ID.
ld
) from the original search for each token, in turn.This query, for example, will search against two tokens (i.e., pre-tokenised search string):
with searchcriteria as (
select 'john' token1,
'smith' token2
from dual)
select alpha.id,
mypeople.forename||' '||mypeople.surname
from peoplesearch alpha
join mypeople
on mypeople.student_id = alpha.student_id
join peoplesearch beta
on beta.student_id = alpha.student_id
and beta.position > alpha.position
join searchcriteria
on 1 = 1
where alpha.sound = soundex(searchcriteria.token1)
and beta.sound = soundex(searchcriteria.token2)
order by alpha.position,
ld(alpha.token,searchcriteria.token1),
beta.position,
ld(beta.token,searchcriteria.token2),
alpha.student_id;
To search against an arbitrary number of tokens, we would need to use dynamic SQL: joining the search table as many times as there are tokens, where the position
field in the joined table must be greater than the position
of the previously joined table... I plan to write a function to do this -- as well as the search string tokenisation -- which will return a table of IDs. However, I just post this here so you get the idea :)
As I say, this works pretty well: It returns good results pretty quickly. Even searching for "John Smith", once cached by the server, runs in less than 0.2s; returning over 200 rows... I'm pretty pleased with it and will be looking to put it into production. The only issues are:
The precalculation of tokens takes a while, but it's a one-off process, so not too much of a problem. A related problem however is that a trigger needs to be put on the mypeople
table to insert/update/delete tokens into the search table whenever the corresponding operation is performed on mypeople
. This may slow up the system; but as this should only happen during a few periods in a year, perhaps a better solution would be to rebuild the search table on a scheduled basis.
No stemming is being done, so the Soundex algorithm only matches on full tokens. For example, a search for "chris" will not return any "christopher"s. A possible solution to this is to only store the Soundex of the stem of the token, but calculating the stem is not a simple problem! This will be a future upgrade, possibly using the hyphenation engine used by TeX...
Anyway, hope that helps :) Comments welcome!
EDIT My full solution (write up and implementation) is now here, using Metaphone and the Damerau-Levenshtein Distance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With