Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex replace is taking time for millions of documents, how to make it faster?

I have documents like:

documents = [
    "I work on c programing.",
    "I work on c coding.",
]

I have synonym file such as:

synonyms = {
    "c programing": "c programing",
    "c coding": "c programing"
}

I want to replace all synonyms for which I wrote this code:

# added code to pre-compile all regex to save compilation time. credits alec_djinn

compiled_dict = {}
for value in synonyms:
    compiled_dict[value] = re.compile(r'\b' + re.escape(value) + r'\b')

for doc in documents:
    document = doc
    for value in compiled_dict:
        lowercase = compiled_dict[value]
        document = lowercase.sub(synonyms[value], document)
    print(document)

Output:

I work on c programing.
I work on c programing.

But since the number of documents is a few million and the number of synonym terms are in 10s of thousands, the expected time for this code to finish is 10 days approx.

Is there a faster way to do this?

PS: with the output I want to train word2vec model.

Any help is greatly appreciated. I was thinking of writing some cpython code and putting it in parallel threads.

like image 413
Vikash Singh Avatar asked May 25 '17 10:05

Vikash Singh


1 Answers

I have done string replacement jobs like this before, also for training word2vec models on very large text corpora. When the number of terms to replace (your "synonym terms") is very large, it can make sense to do string replacement using the Aho-Corasick algorithm instead of looping over many single string replacements. You can take a look at my fsed utility (written in Python), which might be useful to you.

like image 110
wildwilhelm Avatar answered Oct 12 '22 13:10

wildwilhelm