Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating a list of every word from a text file without spaces, punctuation

Tags:

python

I have a long text file (a screenplay). I want to turn this text file into a list (where every word is separated) so that I can search through it later on.

The code i have at the moment is

file = open('screenplay.txt', 'r')
words = list(file.read().split())
print words

I think this works to split up all the words into a list, however I'm having trouble removing all the extra stuff like commas and periods at the end of words. I also want to make capital letters lower case (because I want to be able to search in lower case and have both capitalized and lower case words show up). Any help would be fantastic :)

like image 932
Tom F Avatar asked Aug 08 '13 20:08

Tom F


2 Answers

Try the algorithm from https://stackoverflow.com/a/17951315/284795, ie. split text on whitespace, then trim punctuation. This carefully removes punctuation from the edge of words, without harming apostrophes inside words such as we're.

>>> text
"'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'"

>>> text.split()
["'Oh,", 'you', "can't", 'help', "that,'", 'said', 'the', 'Cat:', "'we're", 'all', 'mad', 'here.', "I'm", 'mad.', "You're", "mad.'"]

>>> [word.strip(string.punctuation) for word in text.split()]
['Oh', 'you', "can't", 'help', 'that', 'said', 'the', 'Cat', "we're", 'all', 'mad', 'here', "I'm", 'mad', "You're", 'mad']

You might want to add a .lower()

like image 146
Colonel Panic Avatar answered Sep 21 '22 03:09

Colonel Panic


This is a job for regular expressions!

For example:

import re
file = open('screenplay.txt', 'r')
# .lower() returns a version with all upper case characters replaced with lower case characters.
text = file.read().lower()
file.close()
# replaces anything that is not a lowercase letter, a space, or an apostrophe with a space:
text = re.sub('[^a-z\ \']+', " ", text)
words = list(text.split())
print words
like image 35
Brionius Avatar answered Sep 24 '22 03:09

Brionius