Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Changing phrases to vectors with while function in Python

I would like to change the following phrases to vectors with sklearn:

Article 1. It is not good to eat pizza after midnight
Article 2. I wouldn't survive a day withouth stackexchange
Article 3. All of these are just random phrases
Article 4. To prove if my experiment works.
Article 5. The red dog jumps over the lazy fox

I got the following code:

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=1)

n=0
while n < 5:
   n = n + 1
   a = ('Article %(number)s' % {'number': n})
   print(a)
   with open("LISR2.txt") as openfile:
     for line in openfile:
       if a in line:
           X=line
           print(vectorizer.fit_transform(X))

Which gives me the following error:

ValueError: Iterable over raw text documents expected, string object received.

Why does this happen? I know this should work because if I type in individually:

X=("It is not good to eat pizza","I wouldn't survive a day", "All of these")

print(vectorizer.fit_transform(X))

It gives me my desired vectors.

(0, 8)  1
(0, 2)  1
(0, 11) 1
(0, 3)  1
(0, 6)  1
(0, 4)  1
(0, 5)  1
(1, 1)  1
(1, 9)  1
(1, 12) 1
(2, 10) 1
(2, 7)  1
(2, 0)  1
like image 961
Rafael Martínez Avatar asked Dec 03 '16 19:12

Rafael Martínez


2 Answers

Look at the docs. It says CountVectorizer.fit_transform expects an iterable of strings (e.g. a list of strings). You are passing a single string instead.

It makes sense, fit_transform in scikit does two things: 1) it learns a model (fit) 2) it applies the model on the data (transform). You want to build a matrix, where columns are all the words in the vocabulary and rows correspond to the documents. For that you need to know the whole vocabulary in your corpus (all the columns).

like image 199
SheepPerplexed Avatar answered Nov 02 '22 17:11

SheepPerplexed


This problem occurs when you provide the raw data, means directly giving the string to the extraction function ,instead you can give Y = [X] and pass this Y as the parameter then you will get it correct i faced this problem too

like image 32
purna15111 Avatar answered Nov 02 '22 15:11

purna15111