Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is full text search suitable for searching people's names?

I have a user table with a single column for a person's name:

CREATE TABLE [dbo].[Users]
(
    Id bigint NOT NULL,
    Name nvarchar(80) NOT NULL,
    PRIMARY KEY CLUSTERED (Id ASC)
)

The Name column can contain either a full name or just the first name or anything really (separated by spaces). In order to implement a search on Name, I would like to utilize SQL's full-text search, but not sure if it's suitable for searching names/nicknames and not actual words. Also the question is - which language do I choose when creating the FT index on Name?

Any other considerations?

Thank you.

like image 203
niaher Avatar asked Nov 18 '12 09:11

niaher


People also ask

What is full-text search used for?

Full-text search is meant to search large amounts of text. For example, a search engine will use a full-text search to look for keywords in all the web pages that it indexed. The key to this technique is indexing.

What is full-text search example?

A Full-Text query allows you to search for words inside text data. For example, with a Full-Text query you can search for the word “werewolf” in the title of any of our movies. The basic unit of Full-Text search is often referred to as a token rather than a word.

What is the use of full-text search in SQL Server?

Full-text queries perform linguistic searches against text data in full-text indexes by operating on words and phrases based on the rules of a particular language such as English or Japanese. Full-text queries can include simple words and phrases or multiple forms of a word or phrase.

What is full-text search vs LIKE?

Like uses wildcards only, and isn't all that powerful. Full text allows much more complex searching, including And, Or, Not, even similar sounding results (SOUNDEX) and many more items.


2 Answers

At first sight, I would recommend using the LIKE operator rather than a full-text query.

Make sure you search case insensitive and probably accent insensitive. This can be achieved by the setting the correct collation on either the server, database, the table column or in your query. In the query, this is done by something like:

SELECT *
FROM [dbo].[Users]
WHERE Name LIKE '%niaher%' COLLATE SQL_Latin1_General_CP1_CI_AI

If you use a full-text index, you get all sorts of features, like verb stemming and thesaurus, see Linguistic Components and Language Support in Full-Text Search, which you do not need, when searching in a list of names. By the way, these features are language dependent and that's why you specify a language on the full-text index.

Application of a stoplist you might even want to avoid. At least I would, since in Dutch many surnames start with articles and/or prepositions: "Rembrandt van Rijn". "van" would be in a Dutch stoplist for sure and prevent any match on a search term that contains "van".

If you run into performance issues, it might be useful to try a full-text index and search using CONTAINS with a simple term.

SELECT *
FROM [dbo].[Users]
WHERE CONTAINS(Name, 'niaher')

Please note that full-text indexes are updated asynchronously.

like image 63
R. Schreurs Avatar answered Sep 19 '22 10:09

R. Schreurs


It seems that if you want to search multi-part names, full-text search is the easiest and most appropriate approach (please correct me if I'm wrong). The other alternative being LIKE '%query%', however it has too many disadvantages:

  • Terrible performance, since it does index scan
  • Order of terms matters, e.g. - searching for "John Smith" and "Smith John" will return different results.
  • It disregards word boundaries, e.g. - searching for "Ann" will also retrieve "Joanna" and "Danny", which aren't useful matches.

So I went ahead and implemented a full-text search. My queries look something like this:

SELECT * FROM Users WHERE CONTAINS(Name, '"John*"')

The only slight difficulty is that I had to convert user query (John) into a CONTAINS-friendly query ("John*"). To do that, I implemented this method in my UserRepository:

/// <summary>
/// Converts user-entered search query into a query that can be consumed by CONTAINS keyword of SQL Server.
/// </summary>
/// <example>If query is "John S Ju", the result will be "\"John*\" AND \"S*\" AND \"Ju*\"".</example>
/// <param name="query">Query entered by user.</param>
/// <returns>String instance.</returns>
public static string GetContainsQuery(string query)
{
    string containsQuery = string.Empty;

    var terms = query.Split(new[] { ' ' }, StringSplitOptions.None);

    if (terms.Length > 1)
    {
        for (int i = 0; i < terms.Length; i++)
        {
            string term = terms[i].Trim();

            // Add wildcard term, e.g. - "term*". The reason to add wildcard is because we want
            // to allow search by partially entered name parts (partially entered first name and/or
            // partially entered last name, etc).
            containsQuery += "\"" + term + "*\"";

            // If it's not the last term.
            if (i < terms.Length - 1)
            {
                // We want all terms inside user query to match.
                containsQuery += " AND ";
            }
        }

        containsQuery = containsQuery.Trim();
    }
    else
    {
        containsQuery = "\"" + query + "*\"";
    }

    return containsQuery;
}

Hope this helps anyone stumbling into the same issue.

PS - I wrote a blogpost documenting this.

like image 31
niaher Avatar answered Sep 20 '22 10:09

niaher