I'm using the nltk
library's movie_reviews
corpus which contains a large number of documents. My task is get predictive performance of these reviews with pre-processing of the data and without pre-processing. But there is problem, in lists documents
and documents2
I have the same documents and I need shuffle them in order to keep same order in both lists. I cannot shuffle them separately because each time I shuffle the list, I get other results. That is why I need to shuffle the at once with same order because I need compare them in the end (it depends on order). I'm using python 2.7
Example (in real are strings tokenized, but it is not relative):
documents = [(['plot : two teen couples go to a church party , '], 'neg'), (['drink and then drive . '], 'pos'), (['they get into an accident . '], 'neg'), (['one of the guys dies'], 'neg')] documents2 = [(['plot two teen couples church party'], 'neg'), (['drink then drive . '], 'pos'), (['they get accident . '], 'neg'), (['one guys dies'], 'neg')]
And I need get this result after shuffle both lists:
documents = [(['one of the guys dies'], 'neg'), (['they get into an accident . '], 'neg'), (['drink and then drive . '], 'pos'), (['plot : two teen couples go to a church party , '], 'neg')] documents2 = [(['one guys dies'], 'neg'), (['they get accident . '], 'neg'), (['drink then drive . '], 'pos'), (['plot two teen couples church party'], 'neg')]
I have this code:
def cleanDoc(doc): stopset = set(stopwords.words('english')) stemmer = nltk.PorterStemmer() clean = [token.lower() for token in doc if token.lower() not in stopset and len(token) > 2] final = [stemmer.stem(word) for word in clean] return final documents = [(list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category)] documents2 = [(list(cleanDoc(movie_reviews.words(fileid))), category) for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category)] random.shuffle( and here shuffle documents and documents2 with same order) # or somehow
The syntax is: random. sample(list,k) where k represents, number of values to be sampled. You can check data science with python course to go through the topic of data science with python.
Python Random shuffle() Method The shuffle() method takes a sequence, like a list, and reorganize the order of the items. Note: This method changes the original list, it does not return a new list.
You can try one of the following two approaches to shuffle both data and labels in the same order. Approach 1: Using the number of elements in your data, generate a random index using function permutation(). Use that random index to shuffle the data and labels.
You can use numpy. random. shuffle() . This function only shuffles the array along the first axis of a multi-dimensional array.
You can do it as:
import random a = ['a', 'b', 'c'] b = [1, 2, 3] c = list(zip(a, b)) random.shuffle(c) a, b = zip(*c) print a print b [OUTPUT] ['a', 'c', 'b'] [1, 3, 2]
Of course, this was an example with simpler lists, but the adaptation will be the same for your case.
Hope it helps. Good Luck.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With