Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fuzzy text searching in Oracle

I have a large Oracle DB table which contains street names for a whole country, which has 600000+ rows. In my application, I take an address string as input and want to check whether specific substrings of this address string matches one or many of the street names in the table, such that I can label that address substring as the name of a street.

Clearly, this should be a fuzzy text matching problem, there is only a small chance that the substring I query has an exact match with the street names in DB table. So there should be some kind of fuzzy text matching approach. I am trying to read the Oracle documentation at http://docs.oracle.com/cd/B28359_01/text.111/b28303/query.htm in which CONTAINS and CATSEARCH search operators are explained. But these seem to be used for more complex tasks like searching a match for the given string in documents. I just want to do that for a column of a table.

What do you suggest me in this case, does Oracle have support for such kind of fuzzy text matching queries?

like image 234
Ufuk Can Bicici Avatar asked Aug 12 '14 07:08

Ufuk Can Bicici


People also ask

What is fuzzy search in Oracle?

Fuzzy matching is a method that provides an improved ability to process word-based matching queries to find matching phrases or sentences from a database. Oracle has tools that can help - Enterprise Data Quality, for instance. But it's not always practical to bring in another tool.

What is a fuzzy search in SQL?

A technique of finding the strings that match a pattern approximately (rather than exactly). Users / Reviewers often capture names inaccurately.

What is fuzzy search example?

For example, if a user types "Misissippi" into Yahoo or Google -- both of which use fuzzy matching -- a list of hits is returned along with the question, "Did you mean Mississippi?" Alternative spellings and words that sound the same but are spelled differently are given.

What is fuzzy search how fuzzy search works?

A fuzzy search searches for text that matches a term closely instead of exactly. Fuzzy searches help you find relevant results even when the search terms are misspelled. To perform a fuzzy search, append a tilde (~) at the end of the search term.


2 Answers

UTL_MATCH contains methods for matching strings and comparing their similarity. The edit distance, also known as the Levenshtein Distance, might be a good place to start. Since one string is a substring it may help to compare the edit distance relative to the size of the strings.

--Addresses that are most similar to each substring.
select substring, address, edit_ratio
from
(
    --Rank edit ratios.
    select substring, address, edit_ratio
        ,dense_rank() over (partition by substring order by edit_ratio desc) edit_ratio_rank
    from
    (
        --Calculate edit ratio - edit distance relative to string sizes.
        select
            substring,
            address,
            (length(address) - UTL_MATCH.EDIT_DISTANCE(substring, address))/length(substring) edit_ratio
        from
        (
            --Fake addreses (from http://names.igopaygo.com/street/north_american_address)
            select '526 Burning Hill Big Beaver District of Columbia 20041'   address from dual union all
            select '5206 Hidden Rise Whitebead Michigan 48426'                address from dual union all
            select '2714 Noble Drive Milk River Michigan 48770'               address from dual union all
            select '8325 Grand Wagon Private Sleeping Buffalo Arkansas 72265' address from dual union all
            select '968 Iron Corner Wacker Arkansas 72793'                    address from dual
        ) addresses
        cross join
        (
            --Address substrings.
            select 'Michigan'           substring from dual union all
            select 'Not-So-Hidden Rise' substring from dual union all
            select '123 Fake Street'    substring from dual
        )
        order by substring, edit_ratio desc
    )
)
where edit_ratio_rank = 1
order by substring, address;

These results are not great but hopefully this is at least a good starting point. It should work with any language. But you'll still probably want to combine this with some language- or locale- specific comparison rules.

SUBSTRING          ADDRESS                                                  EDIT_RATIO
---------          -------                                                  ----------
123 Fake Street    526 Burning Hill Big Beaver District of Columbia 20041   0.5333
Michigan           2714 Noble Drive Milk River Michigan 48770               1
Michigan           5206 Hidden Rise Whitebead Michigan 48426                1
Not-So-Hidden Rise 5206 Hidden Rise Whitebead Michigan 48426                0.5
like image 96
Jon Heller Avatar answered Oct 22 '22 10:10

Jon Heller


You could make use of the SOUNDEX function available in Oracle databases. SOUNDEX computes a numeric signature of a text string. This can be used to find strings which sound similar and thus reduce the number of string comparisons.

Edited: If SOUNDEX is not suitable for your local language, you can ask Google for a phonetic signature or phonetic matching function which performs better. This function has to be evaluated once per new table entry and once for every query. Therefore, it does not need to reside in Oracle.

Example: A Turkish SOUNDEX is promoted here.

To increase the matching quality, the street name spelling should be unified in a first step. This could be done by applying a set of rules:

Simplified example rules:

  1. Convert all characters to lowercase
  2. Remove "str." at the end of a name
  3. Remove "drv." at the end of a name
  4. Remove "place" at the end of a name
  5. Remove "ave." at the end of a name
  6. Sort names with multiple words alphabetically
  7. Drop auxiliary words like "of", "and", "the", ...
like image 31
Axel Kemper Avatar answered Oct 22 '22 12:10

Axel Kemper