Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing punctuation/numbers from text problem

Tags:

python

nltk

I had some code that worked fine removing punctuation/numbers using regular expressions in python, I had to change the code a bit so that a stop list worked, not particularly important. Anyway, now the punctuation isn't being removed and quite frankly i'm stumped as to why.

import re
import nltk

# Quran subset
filename = raw_input('Enter name of file to convert to ARFF with extension, eg. name.txt: ')

# create list of lower case words
word_list = re.split('\s+', file(filename).read().lower())
print 'Words in text:', len(word_list)
# punctuation and numbers to be removed
punctuation = re.compile(r'[-.?!,":;()|0-9]')
for word in word_list:
    word = punctuation.sub("", word)
print word_list

Any pointers on why it's not working would be great, I'm no expert in python so it's probably something ridiculously stupid. Thanks.

like image 893
Alex Avatar asked Apr 01 '11 11:04

Alex


People also ask

How do you remove punctuation from a string?

One of the easiest ways to remove punctuation from a string in Python is to use the str. translate() method. The translate method typically takes a translation table, which we'll do using the . maketrans() method.

How do you strip punctuation in Python?

Use regex to Strip Punctuation From a String in Python The regex pattern [^\w\s] captures everything which is not a word or whitespace(i.e. the punctuations) and replaces it with an empty string.

How do I delete numbers from NLP?

Removing Numbers So, numbers can be removed from text. We can use regular-expressions (regex) to get rid of numbers. This step can be combined with above one to achieve in single step. remove_numbers(“007 Not sure@ if this % was #fun!


2 Answers

Change

for word in word_list:
    word = punctuation.sub("", word)

to

word_list = [punctuation.sub("", word) for word in word_list]    

Assignment to word in the for-loop above, simply changes the value referenced by this temporary variable. It does not alter word_list.

like image 181
unutbu Avatar answered Oct 16 '22 14:10

unutbu


You're not updating your word list. Try

for i, word in enumerate(word_list):
    word_list[i] = punctuation.sub("", word)

Remember that although word starts off as a reference to the string object in the word_list, assignment rebinds the name word to the new string object returned by the sub function. It doesn't change the originally referenced object.

like image 35
Martin Stone Avatar answered Oct 16 '22 14:10

Martin Stone