Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

using REPLACE in WHERE clause to check spelling permutations - MS SQL

I have a table like:

| id | lastname | firstname |
|  1 | doe      | john      |
|  2 | oman     | donald    |
|  3 | o'neill  | james     |
|  4 | onackers | sharon    |

Essentially, users are going to be searching by the first letters of the last name.

I want to be able to return results that contain and don't contain punctuation from the database. For instance, when a user searches for: on

I want to return both: o'neill, onackers

I want someone to be able to search "o, on, oneill, o neill, etc" to get o'neill.

So the best way to do this seems to take the lastname column value and have two permutations of it searched in the WHERE clause with an OR. One where any special characters are replaced with the _ in SQL, and one where all non-alpha chars (including spaces) are gone.

I figure I can use the underscore in the SQL replace to keep the one space available.

I'm having a little trouble with the WHERE clause. I'd prefer to do this with a simple REPLACE rather than creating a regex function if possible. If that's a no-go though, I understand:

@last_name (this is the nvarchar input)

SELECT id, lastname, firstname
FROM people
WHERE ((REPLACE(people.lastname, '[^A-Za-z]', '_') like @last_name + '%')
OR (REPLACE(people.lastnname,'[^A-Za-z ]', '') like @last_name + '%'))
ORDER BY lastname

I'm pretty sure the replace part has to be on the other side of the LIKE. I'm messing up the structure but need some help.

I am Using MSSQL Server 2005.

Thank you so much in advance.

UPDATE

It seems like I have two options:

  1. Create a regular expression function using CLR (excuse me if I'm saying this wrong, I'm new to it)
  2. Create extra columns on the table or create a new "fuzzyTable" with the cleaned up last names.

The database gets updated once a night. I have actually already begun the new table approach, as it was what I was originally going to do. However, I'm beginning to think it's smarter to add the "fuzzy" columns to the main table and then on the nightly update to add the adjusted lastnames to the new / updated rows.

Stack Overflow: Which approach is better? User-defined REGEX function I can use in the SQL, and thus avoid extra columns? Or adding the extra column or two to the table? Or a new table?

like image 854
Steph Rose Avatar asked Jun 06 '11 13:06

Steph Rose


People also ask

Can I use Replace in the WHERE clause?

How can MySQL REPLACE() function be used with WHERE clause? As we know that WHERE clause is used to put condition/s in MySQL query and MySQL returns result set based on those conditions. Similarly when we use REPLACE() function with WHERE clause, the result set will depend upon the conditions provided.

Can having replace WHERE in SQL?

It is possible to replace HAVING with WHERE, but there is no benefit in doing so.

How do I find and replace a word in SQL?

On the Edit menu, point to Find and Replace, and then click Quick Find to open the dialog box with find options, but without replace options. On the Edit menu, point to Find and Replace, and then click Quick Replace to open the dialog box with both find options and replace options.


1 Answers

Depending on how complex your scenario can get, this will be lots of work, and slow too. But there's a more flexible approach. Consider something like this, referred to as initialTable:

| id | lastname | firstname |
|  1 | o'malley | josé      |
|  2 | omállèy  | dònáld    |
|  3 | o'neill  | jámès     |
|  4 | onackers | sharon    |

Maybe a bit much, but it illustrates the general problem. I had to implement a "fuzzy" search for our intranet website based on character data that looked very similar - there's many accents in french or spanish names or street addresses for example.

What I did was define a function that performed all replacements for a given string, for example (pseudocode):

function string replacestuff(string input)
{
  input = replace(input, "è", "e");
  input = replace(input, "é", "e");
  input = replace(input, "ò", "o");
  input = replace(input, "ó", "o");
  input = replace(input, "'", "");
  ...
  return input;
}

Using this conversion function, create a second table fuzzyTable that has the following content:

| id | lastname | firstname |
|  1 | omalley  | jose      |
|  2 | omalley  | donald    |
|  3 | oneill   | james     |
|  4 | onackers | sharon    |

Now, assume you'll get an input string for your search of josè. This can't be found in either table. What you'll have to do is this:

declare @input varchar(50)
declare @input_mod varchar(50)
set @input = 'josè'
set @input_mod = replacestuff(@input)

SELECT id FROM initialTable WHERE firstname like @input OR firstname like @input_mod
UNION
SELECT id FROM fuzzyTable WHERE firstname like @input OR firstname like @input_mod
GROUP BY id

(Of course, you'd have to add % to make LIKE work.) The key here is to modify your input search string using the replacement function; this way you'll get a match if searching for against a content of because both come down to se when being processed by the replacement function.

You could even do a two-level search; first check only the unmodified string against the proper table and then with the statement shown above do a fuzzy search if the user says so.

This is a very flexible approach and can handle all sorts of stuff, like finding german letters ä, ö, ü, ß by using two-letter expressions ae, oe, ue, ss. The disadvantage is that you'll have to keep duplicates of some data, and change those duplicates within fuzzyTable as the initialTable (or the replacement function) changes. In our current use case, the intranet database gets updated once a night, so it's not a problem.

EDIT

You need to be aware that, using this, in some cases you'll get false positives. For example, we're using this for an employee search, and if you've got a dutch name spelled Hoek, you'd also find this name searching for Hök, because in german the replacement for ö would be oe. This could be solved using country-aware replacement functions, but we never took the concept this far. Depending on your input data this is more or less academic, for our use case I can't remember anyone complaining.

The main reason why we came up with this approach in the first place was that some of the data we had to work with was riddled with spelling errors, ie. in french many vowels were accented the wrong way around, but still we needed to deliver a result.

like image 137
takrl Avatar answered Nov 14 '22 23:11

takrl