I have documents like:
documents = [
"I work on c programing.",
"I work on c coding.",
]
I have synonym file such as:
synonyms = {
"c programing": "c programing",
"c coding": "c programing"
}
I want to replace all synonyms for which I wrote this code:
# added code to pre-compile all regex to save compilation time. credits alec_djinn
compiled_dict = {}
for value in synonyms:
compiled_dict[value] = re.compile(r'\b' + re.escape(value) + r'\b')
for doc in documents:
document = doc
for value in compiled_dict:
lowercase = compiled_dict[value]
document = lowercase.sub(synonyms[value], document)
print(document)
Output:
I work on c programing.
I work on c programing.
But since the number of documents is a few million and the number of synonym terms are in 10s of thousands, the expected time for this code to finish is 10 days approx.
Is there a faster way to do this?
PS: with the output I want to train word2vec model.
Any help is greatly appreciated. I was thinking of writing some cpython code and putting it in parallel threads.
I have done string replacement jobs like this before, also for training word2vec models on very large text corpora. When the number of terms to replace (your "synonym terms") is very large, it can make sense to do string replacement using the Aho-Corasick algorithm instead of looping over many single string replacements. You can take a look at my fsed utility (written in Python), which might be useful to you.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With