I would like to change the following phrases to vectors with sklearn:
Article 1. It is not good to eat pizza after midnight
Article 2. I wouldn't survive a day withouth stackexchange
Article 3. All of these are just random phrases
Article 4. To prove if my experiment works.
Article 5. The red dog jumps over the lazy fox
I got the following code:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=1)
n=0
while n < 5:
n = n + 1
a = ('Article %(number)s' % {'number': n})
print(a)
with open("LISR2.txt") as openfile:
for line in openfile:
if a in line:
X=line
print(vectorizer.fit_transform(X))
Which gives me the following error:
ValueError: Iterable over raw text documents expected, string object received.
Why does this happen? I know this should work because if I type in individually:
X=("It is not good to eat pizza","I wouldn't survive a day", "All of these")
print(vectorizer.fit_transform(X))
It gives me my desired vectors.
(0, 8) 1
(0, 2) 1
(0, 11) 1
(0, 3) 1
(0, 6) 1
(0, 4) 1
(0, 5) 1
(1, 1) 1
(1, 9) 1
(1, 12) 1
(2, 10) 1
(2, 7) 1
(2, 0) 1
Look at the docs. It says CountVectorizer.fit_transform
expects an iterable of strings (e.g. a list of strings). You are passing a single string instead.
It makes sense, fit_transform in scikit does two things: 1) it learns a model (fit) 2) it applies the model on the data (transform). You want to build a matrix, where columns are all the words in the vocabulary and rows correspond to the documents. For that you need to know the whole vocabulary in your corpus (all the columns).
This problem occurs when you provide the raw data, means directly giving the string to the extraction function ,instead you can give Y = [X] and pass this Y as the parameter then you will get it correct i faced this problem too
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With