Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding best matched word from large Vocalist

I have a pandas data frame that contains two columns named Potential Word, Fixed Word. The Potential Word column contains words of different languages which contains spell mistakes words and correct words and the Fixed Word column contains the correct words corresponded to Potential Word.

Below I have shared some of the samples data

Potential Word Fixed Word
Exemple Example
pipol People
pimple Pimple
Iunik unique

My vocab Dataframe contains 600K unique row.

My Solution:

key = given_word
glob_match_value = 0
potential_fixed_word = ''
match_threshold = 0.65
for each in df['Potential Word']:
    match_value = match(each, key) # match is a function that returns a 
    # similarity value of two strings
    if match_value > glob_match_value and match_value > match_threshold:
        glob_match_value = match_value
        potential_fixed_word = each

Problem

The problem with my code its takes a lot of time to fix every word because of the loop running through the large vocab list. When a word is missed on the vocab then it takes almost 5 or 6 sec to solve a sentence of 10 ~12 words. The match function performs decently so the objective of the optimization.

I need optimized solution help me here

like image 820
Tarequzzaman Khan Avatar asked Feb 09 '21 08:02

Tarequzzaman Khan


Video Answer


2 Answers

In the perspective of Information Retrieval (IR), You need to reduce search space. Matching the given_word (as key) against all Potential Words is definitely inefficient. Instead, you need to match against a reasonable number of candidates.

To find such candidates, you need to index Potential Words and Fixed Words.

from whoosh.analysis import StandardAnalyzer
from whoosh.fields import Schema, TEXT
from whoosh.index import create_in

ix = create_in("indexdir", Schema(
    potential=TEXT(analyzer=StandardAnalyzer(stoplist=None), stored=True),
    fixed=TEXT(analyzer=StandardAnalyzer(stoplist=None), stored=True)
))
writer = ix.writer()
writer.add_document(potential='E x e m p l e', fixed='Example')
writer.add_document(potential='p i p o l', fixed='People')
writer.add_document(potential='p i m p l e', fixed='Pimple')
writer.add_document(potential='l u n i k', fixed='unique')
writer.commit()

With this index, you can search some candidates.

from whoosh.qparser import SimpleParser

with ix.searcher() as searcher:
    results = searcher.search(SimpleParser('potential', ix.schema).parse('p i p o l'))
    for result in results[:2]:
        print(result)

The output is

<Hit {'fixed': 'People', 'potential': 'p i p o l'}>
<Hit {'fixed': 'Pimple', 'potential': 'p i m p l e'}>

Now, you can match the given_word only against few candidates, instead of all 600K.

It is not perfect, however, this is inevitable trade-off and how IR essentially works. Try this with different numbers of candidates.

like image 91
ghchoi Avatar answered Oct 17 '22 14:10

ghchoi


Without changing much on your implementation, as I believe it is somewhat required to iterate over the list of potential words for each word.

Here my intention is not to optimize the match function itself but to leverage multiple threads to search in parallel.

import concurrent.futures
import time
from concurrent.futures.thread import ThreadPoolExecutor
from typing import Any, Union, Iterator

import pandas as pd

# Replace your dataframe here for testing this

df = pd.DataFrame({'Potential Word': ["a", "b", "c"], "Fixed Word": ["a", "c", "b"]})

# Replace by your match function

def match(w1, w2):
    # Simulate some work is happening here
    time.sleep(1)
    return 1

# This is mostly your function itself
# Using index to recreate the sentence from the returned values
def matcher(idx, given_word):
    key = given_word
    glob_match_value = 0
    potential_fixed_word = ''
    match_threshold = 0.65
    for each in df['Potential Word']:
        match_value = match(each, key)  # match is a function that returns a
        # similarity value of two strings
        if match_value > glob_match_value and match_value > match_threshold:
            glob_match_value = match_value
            potential_fixed_word = each
            return idx, potential_fixed_word
        else:
            # Handling default case, you might want to change this
            return idx, ""


sentence = "match is a function that returns a similarity value of two strings match is a function that returns a " \
           "similarity value of two strings"

start = time.time()

# Using a threadpool executor 
# You can increase or decrease the max_workers based on your machine
executor: Union[ThreadPoolExecutor, Any]
with concurrent.futures.ThreadPoolExecutor(max_workers=24) as executor:
    futures: Iterator[Union[str, Any]] = executor.map(matcher, list(range(len(sentence.split()))), sentence.split())

# Joining back the input sentence
out_sentence = " ".join(x[1] for x in sorted(futures, key=lambda x: x[0]))
print(out_sentence)
print(time.time() - start)

Please note that the runtime of this will depend upon

  1. The maximum time taken by a single match call
  2. Number of words in the sentence
  3. Number of worker threads (TIP: Try to see if you can use as many as the number of words in the sentence)
like image 2
nimish Avatar answered Oct 17 '22 16:10

nimish