Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fuzzy searching a large set of data efficiently?

I have a database in the following format:

"filename": {
        "url": "base64 of Zlib compressed URL to decrease size of file",
        "date": "Date added to database",
        "size": "Size of file in URL(not always present)"
}

The format is tree-like. for example:

"dirname": {
        "url": "base64 of Zlib compressed URL to decrease size of file",
        "date": "Date added to database",
        [...files and directories in this directory...]
}

can contain more files and directories.


Goal

I'm trying to fuzzy search just the names and return the URL/date(/size) of entries in the database. It currently has 6.5M strings with an average length of about 36 characters. Duplicate names are present in the database.


What I've tried

I figured it would be faster to load the data to RAM first. I have only 8GB on my laptop so i figured to lower the usage i would save the data in list format, where the URL is compressed with Zlib to further decrease RAM usage. The format is something like this:

[["file or directory name", "zlib compressed url", "date", "size if exists"], ...]

which rounds up to around 3GBs currently. Then i splice the list into 20 pieces using an iterator, and passing the iterator to a function and running that in a separate Process.

results = manager.list() # python multiprocessing shared list
#in a loop that splices into n pieces(20 currently):
    p = multiprocessing.Process(target=self.slice_search, args=(results, name, iter(self.minimal_data[start:i]), function_to_use, min_score,))
    processes.append(p)
    p.start()

the "function_to_use" is currently fuzz.QRatio from fuzzywuzzy, "slice_search" is a function that appends the data into a shared list if the result of "function_to_use" on the string is more than a certain threshold.
The results are in stored in a similar format:

[["score", "file or directory name", "zlib compressed url", "date", "size if exists"], ...]

and are sorted after the search is over and saved to a file in human-readable format(the URL's are also decompressed).


Problem

with all of this it still takes about 20-30 seconds to do the search. I truly believe there's a better way, but i don't have the knowledge required to make it happen. My final goal is to get it to work at least faster than 10 seconds. I would appreciate any help or direction you can point me to.

like image 342
Xosrov Avatar asked May 06 '20 08:05

Xosrov


People also ask

How do you do a fuzzy search?

Many search engines enable users to specifically request a fuzzy search in the search query by using a tilde (~) at the end of the word or term they want to search with fuzziness.

Is Fuzzy Wuzzy slow?

FuzzyWuzzy package is a Levenshtein distance based method which widely used in computing similarity scores of strings. But why we should not use it? The answer is simple: it is way too slow. The estimated time of computing similarity scores for a 406,000-entity dataset of addresses is 337 hours.

What is Fuzzy Wuzzy algorithm?

FuzzyWuzzy is a library of Python which is used for string matching. Fuzzy string matching is the process of finding strings that match a given pattern. Basically it uses Levenshtein Distance to calculate the differences between sequences.


Video Answer


1 Answers

For this answer I will work on the following data:

import string
import random
random.seed(18)
str_options = string.ascii_uppercase + string.ascii_lowercase + string.digits + ' '
query = \'\'.join(random.choice(str_options) for _ in range(30))
choices = [\'\'.join(random.choice(str_options) for _ in range(30)) for s in range(6500000)]

I will not use any multiprocessing for now, but this can be done in a parallel way aswell.

  1. this is about what your current solution looks like:
from fuzzywuzzy import fuzz
results = []
for choice in choices:
    if fuzz.QRatio(query, choice) >= 80:
        results.append(choice)

As you mentioned fuzzywuzzy already uses python-Levenshtein for the levenshtein calculations which is pretty optimised. However before calculating fuzzywuzzy checks whether both strings are empty or equal so it can return early without calculating the levenshtein distance. While this sounds like a good idea it really is not, since checking whether two strings are the same requires to iterate over the whole string to check it. It is a lot better to remove the common prefix and suffix before the levenshtein calculation (speeds it up so e.g. for equal strings it is linear in time). This is a bit slower when the strings are exactly the same, but when working on fuzzy data this is very unlikely the case. This first solution runs in about 55 seconds on my machine

  1. This replaces fuzzywuzzy with RapidFuzz (I am the author) which is a MIT Licensed reimplementation of FuzzyWuzzy that is mostly implemented in C++ and performs a lot better since it fixes quite a few performance issues in FuzzyWuzzy
from rapidfuzz import fuzz
results = []
for choice in choices:
    if fuzz.QRatio(query, choice) >= 80:
        results.append(choice)

This only requires about 18 seconds on my machine, so it is already about a 3x improvement. Another issue is that using fuzz.QRatio preprocesses both strings to lowercase them and remove some unwanted characters. While this generally makes sence this means that the query here gets preprocess 6.5 million times instead of once.

  1. This only preprocesses the query once
from rapidfuzz import fuzz, utils
results = []
processed_query = utils.default_process(query)
for choice in choices:
    processed_choice = utils.default_process(choice)
    if fuzz.ratio(processed_query, processed_choice, score_cutoff=80):
        results.append(choice)

It takes 14 seconds on my machine. This shows that you might want to store your filenames in a preprocessed way aswell, so you do not have to preprocess them when searching either (this would get it down to around 11 seconds). At this point the main time requirement is calculating the levenshtein distance which is a O(m*n) operation. So it would be good to reduce the amount of results where this has to be done. A quick way that is already used by RapidFuzz by default is comparing the length of the two strings, since they can not reach the required ratio when they have a big length difference and can be calculated in constant time, since the lengths are already known anyways. However in my test case here this will never apply since all strings have a length of 30. When a even faster solution is required you can still calculate this on multiple cores. You can use the C++ version RapidFuzz-cpp aswell (it does not has all features from the python version yet, but enough to implement this aswell)

The pure C++ version of RapidFuzz still needs a little work and especially documentation, but it can be implemented in the following way:

using rapidfuzz::string_utils::default_process;
using rapidfuzz::fuzz::CachedRatio;

std::string query("example");
std::vector<std::string> choices{"example", "example2", "example3"};

std::string processed_query = default_process(query);
std::vector<std::string> results;
CachedRatio<std::string> scorer(processed_query);
for (const auto& choice : choices) {
    std::string processed_choice = default_process(choice);

    if (scorer.ratio(processed_choice, 80)) {
        results.push_back(choice);
    }
}
like image 199
maxbachmann Avatar answered Sep 27 '22 16:09

maxbachmann