I have a list of ~50,000 strings (titles), and a list of ~150 words to remove from these titles, if they are found. My code so far is below. The final output should be the list of 50,000 strings, with all instances of the 150 words removed. I would like to know the most efficient (performance wise) way of doing this. My code seems to be running, albeit very slowly ..
excludes = GetExcludes()
titles = GetTitles()
titles_alpha = []
titles_excl = []
for k in range(len(titles)):
#remove all non-alphanumeric characters
s = re.sub('[^0-9a-zA-Z]+', ' ',titles[k])
#remove extra white space
s = re.sub( '\s+', ' ', s).strip()
#lowercase
s = s.lower()
titles_alpha.append(s)
#remove any excluded words
for i in range (len(excludes)):
titles_excl.append(titles_alpha[k].replace(excludes[i],''))
print titles_excl
A lot of the performance overhead of regular expressions comes from compiling the regular expressions. You should move the compilation of the regular expression out of the loop.
This should give you a considerable improvement:
pattern1 = re.compile('[^0-9a-zA-Z]+')
pattern2 = re.compile('\s+')
for k in range(len(titles)):
#remove all non-alphanumeric characters
s = re.sub(pattern1,' ',titles[k])
#remove extra white space
s = re.sub(pattern2,' ', s).strip()
Some tests with wordlist.txt
from here:
import re
def noncompiled():
with open("wordlist.txt",'r') as f:
titles = f.readlines()
titles = ["".join([title,nonalpha]) for title in titles for nonalpha in "!@#$%"]
for k in range(len(titles)):
#remove all non-alphanumeric characters
s = re.sub('[^0-9a-zA-Z]+', ' ',titles[k])
#remove extra white space
s = re.sub( '\s+', ' ', s).strip()
def compiled():
with open("wordlist.txt",'r') as f:
titles = f.readlines()
titles = ["".join([title,nonalpha]) for title in titles for nonalpha in "!@#$%"]
pattern1=re.compile('[^0-9a-zA-Z]+')
pattern2 = re.compile( '\s+')
for k in range(len(titles)):
#remove all non-alphanumeric characters
s = pattern1.sub('',titles[k])
#remove extra white space
s = pattern2.sub('', s)
In [2]: %timeit noncompiled()
1 loops, best of 3: 292 ms per loop
In [3]: %timeit compiled()
10 loops, best of 3: 176 ms per loop
To remove the "bad words" from your excludes list, you should as @zsquare suggested create a joined regex, which will most likely be the fastest that you can get.
def with_excludes():
with open("wordlist.txt",'r') as f:
titles = f.readlines()
titles = ["".join([title,nonalpha]) for title in titles for nonalpha in "!@#$%"]
pattern1=re.compile('[^0-9a-zA-Z]+')
pattern2 = re.compile( '\s+')
excludes = ["shit","poo","ass","love","boo","ch"]
excludes_regex = re.compile('|'.join(excludes))
for k in range(len(titles)):
#remove all non-alphanumeric characters
s = pattern1.sub('',titles[k])
#remove extra white space
s = pattern2.sub('', s)
#remove bad words
s = pattern2.sub('', s)
In [2]: %timeit with_excludes()
1 loops, best of 3: 251 ms per loop
You can take this approach one step further by just compiling a master regex:
def master():
with open("wordlist.txt",'r') as f:
titles = f.readlines()
titles = ["".join([title,nonalpha]) for title in titles for nonalpha in "!@#$%"]
excludes = ["shit","poo","ass","love","boo","ch"]
nonalpha='[^0-9a-zA-Z]+'
whitespace='\s+'
badwords = '|'.join(excludes)
master_regex=re.compile('|'.join([nonalpha,whitespace,badwords]))
for k in range(len(titles)):
#remove all non-alphanumeric characters
s = master_regex.sub('',titles[k])
In [2]: %timeit master()
10 loops, best of 3: 148 ms per loop
You can gain some more speed by avoiding the iteration in python:
result = [master_regex.sub('',item) for item in titles]
In [4]: %timeit list_comp()
10 loops, best of 3: 139 ms per loop
Note: The data generation step:
def baseline():
with open("wordlist.txt",'r') as f:
titles = f.readlines()
titles = ["".join([title,nonalpha]) for title in titles for nonalpha in "!@#$%"]
In [2]: %timeit baseline()
10 loops, best of 3: 24.8 ms per loop
One way to do this would be to dynamically create a regex of your excluded words and replace them in the list.
Something like:
excludes_regex = re.compile('|'.join(excludes))
titles_excl = []
for title in titles:
titles_excl.append(excludes_regex.sub('', title))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With