Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a way to boost matching performance when doing string matching in Python?

I have a very large dictionary which stores large numbers of English sentences and their Spanish translations. When given a random English sentence, I intend to use Python's fuzzywuzzy library to find its closest match in the dictionary. My code:

from fuzzywuzzy import process
sentencePairs = {'How are you?':'¿Cómo estás?', 'Good morning!':'¡Buenos días!'}
query= 'How old are you?'
match = process.extractOne(query, sentencePairs.keys())[0]
print(match, sentencePairs[match], sep='\n')

In real life scenario, the sentencePairs dictionary would be very large, with at least one million items stored. So it will take a long time to get the result with fuzzywuzzy, even if python-Levenshtein is installed to provide speedup. So is there a better way to achieve better performance? My goal is to get the result in less than a few seconds, or even in real time.

like image 497
wbzy00 Avatar asked Jan 25 '23 18:01

wbzy00


2 Answers

Ways to improve the performance

Fuzzy Matching using the Levenshtein Distance will never be super fast, but there are a couple of things in your code you can optimise:

  1. When passing a string and a list to process.extractOne it will preprocess these strings by lowercasing them, removing non alphanumeric characters and trimming whitespaces. Since your reusing the same English:Spanish mapping each time you should do this preprocessing once ahead of time.

  2. Even when using python-Levenshtein FuzzyWuzzy is not really optimised in a lot of places. You should replace it with RapidFuzz which implements the same algorithms with a similar interface, but is mostly implemented in C++ and comes with some additional algorithmic improvements making it a lot faster.

  3. internally process.extractOne is using fuzz.WRatio to compare the strings by default. This is a combination of multiple string matching algorithms. So selecting a faster algorithm by passing e.g. scorer=fuzz.ratio to process.extractOne improves the performance. However keep in mind that this changes the way your strings are compared, so depending on your data you might not want to do this.

Implementation making use of 1 and 2

from rapidfuzz import process, utils
# english sentences are already lower cased
# and without special characters like question marks
sentencePairs = {'how are you':'¿Cómo estás?', 'good morning':'¡Buenos días!'}
query= 'How old are you?'
match, _ = process.extractOne(
   utils.default_process(query),
   sentencePairs.keys(),
   processor=None)
print(match, sentencePairs[match], sep='\n')

Implementation making use of 1, 2 and 3

from rapidfuzz import process, utils, fuzz
# english sentences are already lower cased
# and without special characters like question marks
sentencePairs = {'how are you':'¿Cómo estás?', 'good morning':'¡Buenos días!'}
query= 'How old are you?'
match, _ = process.extractOne(
   utils.default_process(query),
   sentencePairs.keys(),
   processor=None,
   scorer=fuzz.ratio)
print(match, sentencePairs[match], sep='\n')

Benchmarks

To provide some time comparisions I generated a million sentences:

import string
import random
random.seed(18)
sentencePairs = {
    ''.join(random.choice(string.ascii_lowercase + string.digits)
       for _ in range(15)
    ): "spanish text"
    for s in range(1000000)
}
query= 'How old are you?'

The following table shows how long the different solutions require on my computer

| Implementation                           | Runtime        |
|------------------------------------------|----------------|
| Your current implementation              | 18.98 seconds  |
| Implementation making use of 1 and 2     | 1.4 seconds    |
| Implementation making use of 1, 2 and 3  | 0.4 seconds    |
like image 53
maxbachmann Avatar answered Jan 27 '23 06:01

maxbachmann


There could be a better solution but top of my mind I could think of partition.

You can create 26 different dictionaries each representing an English alphabet. And then you can load all these dictionaries with all those keys which starts with corresponding alphabet. E.g. adict, bdict... zdict etc. So. hdict would contain Key value for Key starting with h. Like key= "how are you?"

In this way, you need to query only that dictionary which matches with starting alphabet.

like image 27
Alim Shaikh Avatar answered Jan 27 '23 08:01

Alim Shaikh