Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to iterate through sentence of string in Python?

Assume I have a string text = "A compiler translates code from a source language". I want to do two things:

  1. I need to iterate through each word and stem using the NLTK library. The function for stemming is PorterStemmer().stem_word(word). We have to pass the argument 'word'. How can I stem each word and get back the stemmed sentence?

  2. I need to remove certain stop words from the text string. The list containing the stop words is stored in a text file (space separated)

    stopwordsfile = open('c:/stopwordlist.txt','r+')
    stopwordslist=stopwordsfile.read()
    

    How can I remove those stop words from text and get a cleaned new string?

like image 628
ChamingaD Avatar asked May 08 '12 20:05

ChamingaD


2 Answers

I posted this as a comment, but thought I might as well flesh it out into a full answer with some explanation:

You want to use str.split() to split the string into words, and then stem each word:

for word in text.split(" "):
    PorterStemmer().stem_word(word)

As you want to get a string of all the stemmed words together, it's trivial to then join these stems back together. To do this easily and efficiently we use str.join() and a generator expression:

" ".join(PorterStemmer().stem_word(word) for word in text.split(" "))

Edit:

For your other problem:

with open("/path/to/file.txt") as f:
    words = set(f)

Here we open the file using the with statement (which is the best way to open files, as it handles closing them correctly, even on exceptions, and is more readable) and read the contents into a set. We use a set as we don't care about the order of the words, or duplicates, and it will be more efficient later. I am presuming one word per line - if this isn't the case, and they are comma separated, or whitespace separated then using str.split() as we did before (with appropriate arguments) is probably a good plan.

stems = (PorterStemmer().stem_word(word) for word in text.split(" "))
" ".join(stem for stem in stems if stem not in words)

Here we use the if clause of a generator expression to ignore words that are in the set of words we loaded from a file. Membership checks on a set are O(1), so this should be relatively efficient.

Edit 2:

To remove the words before they are stemmed, it's even simpler:

" ".join(PorterStemmer().stem_word(word) for word in text.split(" ") if word not in words)

The removal of the given words is simply:

filtered_words = [word for word in unfiltered_words if not in set_of_words_to_filter]
like image 124
Gareth Latty Avatar answered Oct 25 '22 21:10

Gareth Latty


To go through on each word in the string:

for word in text.split():
    PorterStemmer().stem_word(word)

Use string's join method (recommended by Lattyware) to concatenate pieces to one big string.

" ".join(PorterStemmer().stem_word(word) for word in text.split(" "))
like image 24
Gergely Sipkai Avatar answered Oct 25 '22 23:10

Gergely Sipkai