Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python re.sub with a list of words to find

Tags:

python

regex

list

I am not too familiar with RE but I am trying to iterate over a list and use re.sub to take out multiple items from a large block of text that is held in the variable first_word.

I use re.sub to remove tags first and this works fine, but I next want to remove all the strings in the exclusionList variable and I am not sure how to do this.

Thanks for the help, here is the code that raises the exception.

exclusionList = ['+','of','<ET>f.','to','the','<L>L.</L>']

for a in range(0, len(exclusionList)):
      first_word = re.sub(exclusionList[a], '',first_word)

And the exception :

first_word = re.sub(exclusionList[a], '',first_word)
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 245, in _compile
raise error, v # invalid expression error: nothing to repeat
like image 267
English Grad Avatar asked Jun 10 '12 12:06

English Grad


1 Answers

The plus symbol is an operator in regex meaning 'one or more repetitions of the preceding'. E.g., x+ means one or more repetitions of x. If you want to find and replace actual + signs, you need to escape it like this: re.sub('\+', '', string). So change the first entry in your exclusionList.

You can also eliminate the for loop, like this:

exclusions = '|'.join(exclusionList)
first_word = re.sub(exclusions, '', first_word)

The pipe symbol | indicates a disjunction in regex, so x|y|z matches x or y or z.

like image 165
Junuxx Avatar answered Nov 15 '22 00:11

Junuxx