I had some code that worked fine removing punctuation/numbers using regular expressions in python, I had to change the code a bit so that a stop list worked, not particularly important. Anyway, now the punctuation isn't being removed and quite frankly i'm stumped as to why.
import re
import nltk
# Quran subset
filename = raw_input('Enter name of file to convert to ARFF with extension, eg. name.txt: ')
# create list of lower case words
word_list = re.split('\s+', file(filename).read().lower())
print 'Words in text:', len(word_list)
# punctuation and numbers to be removed
punctuation = re.compile(r'[-.?!,":;()|0-9]')
for word in word_list:
word = punctuation.sub("", word)
print word_list
Any pointers on why it's not working would be great, I'm no expert in python so it's probably something ridiculously stupid. Thanks.
One of the easiest ways to remove punctuation from a string in Python is to use the str. translate() method. The translate method typically takes a translation table, which we'll do using the . maketrans() method.
Use regex to Strip Punctuation From a String in Python The regex pattern [^\w\s] captures everything which is not a word or whitespace(i.e. the punctuations) and replaces it with an empty string.
Removing Numbers So, numbers can be removed from text. We can use regular-expressions (regex) to get rid of numbers. This step can be combined with above one to achieve in single step. remove_numbers(“007 Not sure@ if this % was #fun!
Change
for word in word_list:
word = punctuation.sub("", word)
to
word_list = [punctuation.sub("", word) for word in word_list]
Assignment to word
in the for-loop
above, simply changes the value referenced by this temporary variable. It does not alter word_list
.
You're not updating your word list. Try
for i, word in enumerate(word_list):
word_list[i] = punctuation.sub("", word)
Remember that although word
starts off as a reference to the string object in the word_list
, assignment rebinds the name word
to the new string object returned by the sub
function. It doesn't change the originally referenced object.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With