Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL and fuzzy comparison

Let's assume we have a table of People (name, surname, address, SSN, etc).

We want to find all rows that are "very similar" to specified person A. I would like to implement some kind of fuzzy logic comparation of A and all rows from table People. There will be several fuzzy inference rules working separately on several columns (e.g. 3 fuzzy rules for name, 2 rules on surname, 5 rules on address)

The question is Which of the following 2 approaches would be better and why?

  1. Implement all fuzzy rules as stored procedures and use one heavy SELECT statement to return all rows that are "very similar" to A. This approach may include using soundex, sim metric etc.

  2. Implement one or more simplier SELECT statements, that returns less accurate results, "rather similar" to A, and then fuzzy-compare A with all returned rows (outside database) to get "very similar" rows. So fuzzy comparation would be implemented in my favorit programming language.

Table People should have up to 500k rows, and I would like to make about 500-1000 queries like this a day. I use MySQL (but this is yet to be considered).

like image 858
running.t Avatar asked Apr 03 '13 23:04

running.t


People also ask

What is a fuzzy search in SQL?

A technique of finding the strings that match a pattern approximately (rather than exactly). Users / Reviewers often capture names inaccurately.

What is fuzzy matching used for?

Fuzzy Matching (also called Approximate String Matching) is a technique that helps identify two elements of text, strings, or entries that are approximately similar but are not exactly the same.

What is fuzzy search in database?

A fuzzy search query searches for character sequences that are not only the same but similar to the query term. Use the tilde symbol (~) at the end of a term to do a fuzzy search. For example, the following query finds documents that include the terms analytics, analyze, analysis, and so on.

Is there a match function in SQL?

MATCH can be used only with graph node and edge tables, in the SELECT statement as part of WHERE clause.


1 Answers

I don't really think there is a definitive answer because it depends on information not available in the question. Anyway, too long for a comment.

DBMSes are good at retrieving information according to indexes. It does not make sense to have a db server wasting time in heavy computations unless it is dedicated for this specific purpose (as answered by @Adrian).

Therefore, your client application should delegate to the DBMS the retrieval of information required by the rules.

If the computations are minor, all could be done on the server. Else, pull it off into the client system.

The disadvantage of the second approach lies in the amount of data traveling from the server to the client and the number of connections to establish. So, typically it is a compromise between computation and data transfer in the server. A balance to be achieved depending on the specificities of the fuzzy rules.

Edit: I've seen in a comment that you are almost sure to have to implement the code in the client. In that case, you should consider an additional criterion, code locality, for maintenance purposes, i.e., try to have all code that is related together, not spreading it between systems (and languages).

like image 62
koriander Avatar answered Sep 22 '22 02:09

koriander