I have a pandas data frame that contains two columns named Potential Word
, Fixed Word
. The Potential Word
column contains words of different languages which contains spell mistakes words and correct words and the Fixed Word
column contains the correct words corresponded to Potential Word
.
Below I have shared some of the samples data
Potential Word | Fixed Word |
---|---|
Exemple | Example |
pipol | People |
pimple | Pimple |
Iunik | unique |
My vocab Dataframe contains 600K unique row.
My Solution:
key = given_word
glob_match_value = 0
potential_fixed_word = ''
match_threshold = 0.65
for each in df['Potential Word']:
match_value = match(each, key) # match is a function that returns a
# similarity value of two strings
if match_value > glob_match_value and match_value > match_threshold:
glob_match_value = match_value
potential_fixed_word = each
Problem
The problem with my code its takes a lot of time to fix every word because of the loop running through the large vocab list. When a word is missed on the vocab then it takes almost 5 or 6 sec to solve a sentence of 10 ~12 words. The match function performs decently so the objective of the optimization.
I need optimized solution help me here
In the perspective of Information Retrieval (IR), You need to reduce search space. Matching the given_word
(as key
) against all Potential Words is definitely inefficient.
Instead, you need to match against a reasonable number of candidates.
To find such candidates, you need to index Potential Words and Fixed Words.
from whoosh.analysis import StandardAnalyzer
from whoosh.fields import Schema, TEXT
from whoosh.index import create_in
ix = create_in("indexdir", Schema(
potential=TEXT(analyzer=StandardAnalyzer(stoplist=None), stored=True),
fixed=TEXT(analyzer=StandardAnalyzer(stoplist=None), stored=True)
))
writer = ix.writer()
writer.add_document(potential='E x e m p l e', fixed='Example')
writer.add_document(potential='p i p o l', fixed='People')
writer.add_document(potential='p i m p l e', fixed='Pimple')
writer.add_document(potential='l u n i k', fixed='unique')
writer.commit()
With this index, you can search some candidates.
from whoosh.qparser import SimpleParser
with ix.searcher() as searcher:
results = searcher.search(SimpleParser('potential', ix.schema).parse('p i p o l'))
for result in results[:2]:
print(result)
The output is
<Hit {'fixed': 'People', 'potential': 'p i p o l'}>
<Hit {'fixed': 'Pimple', 'potential': 'p i m p l e'}>
Now, you can match
the given_word
only against few candidates, instead of all 600K.
It is not perfect, however, this is inevitable trade-off and how IR essentially works. Try this with different numbers of candidates.
Without changing much on your implementation, as I believe it is somewhat required to iterate over the list of potential words for each word.
Here my intention is not to optimize the match function itself but to leverage multiple threads to search in parallel.
import concurrent.futures
import time
from concurrent.futures.thread import ThreadPoolExecutor
from typing import Any, Union, Iterator
import pandas as pd
# Replace your dataframe here for testing this
df = pd.DataFrame({'Potential Word': ["a", "b", "c"], "Fixed Word": ["a", "c", "b"]})
# Replace by your match function
def match(w1, w2):
# Simulate some work is happening here
time.sleep(1)
return 1
# This is mostly your function itself
# Using index to recreate the sentence from the returned values
def matcher(idx, given_word):
key = given_word
glob_match_value = 0
potential_fixed_word = ''
match_threshold = 0.65
for each in df['Potential Word']:
match_value = match(each, key) # match is a function that returns a
# similarity value of two strings
if match_value > glob_match_value and match_value > match_threshold:
glob_match_value = match_value
potential_fixed_word = each
return idx, potential_fixed_word
else:
# Handling default case, you might want to change this
return idx, ""
sentence = "match is a function that returns a similarity value of two strings match is a function that returns a " \
"similarity value of two strings"
start = time.time()
# Using a threadpool executor
# You can increase or decrease the max_workers based on your machine
executor: Union[ThreadPoolExecutor, Any]
with concurrent.futures.ThreadPoolExecutor(max_workers=24) as executor:
futures: Iterator[Union[str, Any]] = executor.map(matcher, list(range(len(sentence.split()))), sentence.split())
# Joining back the input sentence
out_sentence = " ".join(x[1] for x in sorted(futures, key=lambda x: x[0]))
print(out_sentence)
print(time.time() - start)
Please note that the runtime of this will depend upon
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With