Fuzzy text searching in Oracle

Tags:

I have a large Oracle DB table which contains street names for a whole country, which has 600000+ rows. In my application, I take an address string as input and want to check whether specific substrings of this address string matches one or many of the street names in the table, such that I can label that address substring as the name of a street.

Clearly, this should be a fuzzy text matching problem, there is only a small chance that the substring I query has an exact match with the street names in DB table. So there should be some kind of fuzzy text matching approach. I am trying to read the Oracle documentation at http://docs.oracle.com/cd/B28359_01/text.111/b28303/query.htm in which CONTAINS and CATSEARCH search operators are explained. But these seem to be used for more complex tasks like searching a match for the given string in documents. I just want to do that for a column of a table.

What do you suggest me in this case, does Oracle have support for such kind of fuzzy text matching queries?

234

asked Aug 12 '14 07:08

Ufuk Can Bicici

2 Answers

UTL_MATCH contains methods for matching strings and comparing their similarity. The edit distance, also known as the Levenshtein Distance, might be a good place to start. Since one string is a substring it may help to compare the edit distance relative to the size of the strings.

Click to copy

--Addresses that are most similar to each substring.
select substring, address, edit_ratio
from
(
    --Rank edit ratios.
    select substring, address, edit_ratio
        ,dense_rank() over (partition by substring order by edit_ratio desc) edit_ratio_rank
    from
    (
        --Calculate edit ratio - edit distance relative to string sizes.
        select
            substring,
            address,
            (length(address) - UTL_MATCH.EDIT_DISTANCE(substring, address))/length(substring) edit_ratio
        from
        (
            --Fake addreses (from http://names.igopaygo.com/street/north_american_address)
            select '526 Burning Hill Big Beaver District of Columbia 20041'   address from dual union all
            select '5206 Hidden Rise Whitebead Michigan 48426'                address from dual union all
            select '2714 Noble Drive Milk River Michigan 48770'               address from dual union all
            select '8325 Grand Wagon Private Sleeping Buffalo Arkansas 72265' address from dual union all
            select '968 Iron Corner Wacker Arkansas 72793'                    address from dual
        ) addresses
        cross join
        (
            --Address substrings.
            select 'Michigan'           substring from dual union all
            select 'Not-So-Hidden Rise' substring from dual union all
            select '123 Fake Street'    substring from dual
        )
        order by substring, edit_ratio desc
    )
)
where edit_ratio_rank = 1
order by substring, address;

These results are not great but hopefully this is at least a good starting point. It should work with any language. But you'll still probably want to combine this with some language- or locale- specific comparison rules.

Click to copy

SUBSTRING          ADDRESS                                                  EDIT_RATIO
---------          -------                                                  ----------
123 Fake Street    526 Burning Hill Big Beaver District of Columbia 20041   0.5333
Michigan           2714 Noble Drive Milk River Michigan 48770               1
Michigan           5206 Hidden Rise Whitebead Michigan 48426                1
Not-So-Hidden Rise 5206 Hidden Rise Whitebead Michigan 48426                0.5

answered Oct 22 '22 10:10

Jon Heller

You could make use of the SOUNDEX function available in Oracle databases. SOUNDEX computes a numeric signature of a text string. This can be used to find strings which sound similar and thus reduce the number of string comparisons.

Edited: If SOUNDEX is not suitable for your local language, you can ask Google for a phonetic signature or phonetic matching function which performs better. This function has to be evaluated once per new table entry and once for every query. Therefore, it does not need to reside in Oracle.

Example: A Turkish SOUNDEX is promoted here.

To increase the matching quality, the street name spelling should be unified in a first step. This could be done by applying a set of rules:

Simplified example rules:

Convert all characters to lowercase
Remove "str." at the end of a name
Remove "drv." at the end of a name
Remove "place" at the end of a name
Remove "ave." at the end of a name
Sort names with multiple words alphabetically
Drop auxiliary words like "of", "and", "the", ...

answered Oct 22 '22 12:10

Axel Kemper

Related questions
                            
                                What is the best way to store single non-repeating data to a database?
                            
                                Ordering distinct column values by (first value of) other column in aggregate function
                            
                                ORM Select n + 1 performance; join or no join
                            
                                bulk insert from Java into Oracle
                            
                                TSQL - How to URL Encode
                            
                                How to Find Rows which are Duplicates by a Key but Not Duplicates in All Columns?
                            
                                How to sync two MySQL tables?
                            
                                Why 'Select' is called as DML statement ?
                            
                                SQL Server nullable data types size
                            
                                How to prevent 'query timeout expired'? (SQLNCLI11 error '80040e31')
                            
                                RAISERROR WITH NOWAIT not so immediate?
                            
                                Entity Framework filter data by string sql
                            
                                Does pyodbc support any form of named parameters?
                            
                                MySQL: How to sort by column in ascending order, and show NULL at the end instead of the beginning?
                            
                                How to find out whether a table has some unique columns
                            
                                Storing a multiple choice quiz in a database - deciding the schema
                            
                                Selecting multiple rows by ID, is there a faster way than WHERE IN
                            
                                Deleting duplicates keeping the minimum ID
                            
                                How to rename table using SSDT in Visual Studio
                            
                                How does SQLDataReader handle really large queries?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fuzzy text searching in Oracle

Tags:

sql

oracle

fuzzy-search

Ufuk Can Bicici

People also ask

2 Answers

Jon Heller

Axel Kemper

Recent Activity

Donate For Us