Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apply CountVectorizer to column with list of words in rows in Python

I made a preprocessing part for text analysis and after removing stopwords and stemming like this:

test[col] = test[col].apply(
    lambda x: [ps.stem(item) for item in re.findall(r"[\w']+", x) if ps.stem(item) not in stop_words])

train[col] = train[col].apply(
    lambda x: [ps.stem(item) for item in re.findall(r"[\w']+", x) if ps.stem(item) not in stop_words])

I've got a column with list of "cleaned words". Here are 3 rows in a column:

['size']
['pcs', 'new', 'x', 'kraft', 'bubble', 'mailers', 'lined', 'bubble', 'wrap', 'protection', 'self', 'sealing', 'peelandseal', 'adhesive', 'keeps', 'contents', 'secure', 'tamper', 'proof', 'durable', 'lightweight', 'kraft', 'material', 'helps', 'save', 'postage', 'approved', 'ups', 'fedex', 'usps']
['brand', 'new', 'coach', 'bag', 'bought', 'rm', 'coach', 'outlet']

I now want to apply CountVectorizer to this column:

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=1500, analyzer='word', lowercase=False) # will leave only 1500 words
X_train = cv.fit_transform(train[col])

But I got an Error:

TypeError: expected string or bytes-like object

It would be a bit strange to create string from list and than separate by CountVectorizer again.

like image 487
Yury Wallet Avatar asked Dec 08 '17 09:12

Yury Wallet


2 Answers

To apply CountVectorizer to list of words you should disable analyzer.

x=[['ab','cd'], ['ab','de']]
vectorizer = CountVectorizer(analyzer=lambda x: x)
vectorizer.fit_transform(x).toarray()

Out:
array([[1, 1, 0],
       [1, 0, 1]], dtype=int64)
like image 123
Aleksandr Gavrilov Avatar answered Oct 17 '22 04:10

Aleksandr Gavrilov


As I found no other way to avoid an error, I joined the lists in column

train[col]=train[col].apply(lambda x: " ".join(x) )
test[col]=test[col].apply(lambda x: " ".join(x) )

Only after that I started to get the result

X_train = cv.fit_transform(train[col])
X_train=pd.DataFrame(X_train.toarray(), columns=cv.get_feature_names())
like image 45
Yury Wallet Avatar answered Oct 17 '22 02:10

Yury Wallet