Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python - From list of list of tokens to bag of words

I am struggling with computing bag of words. I have a pandas dataframe with a textual column, that I properly tokenize, remove stop words, and stem. In the end, for each document, I have a list of strings.

My ultimate goal is to compute bag of words for this column, I've seen that scikit-learn has a function to do that but it works on string, not on a list of string.

I am doing the preprocessing myself with NLTK and would like to keep it that way...

Is there a way to compute bag of words based on a list of list of tokens ? e.g., something like that:

["hello", "world"]
["hello", "stackoverflow", "hello"]

should be converted into

[1, 1, 0]
[2, 0, 1]

with vocabulary:

["hello", "world", "stackoverflow"]
like image 245
Florian Avatar asked Oct 26 '25 04:10

Florian


2 Answers

You can create DataFrame by filtering with Counter and then convert to lists:

from collections import Counter

df = pd.DataFrame({'text':[["hello", "world"],
                           ["hello", "stackoverflow", "hello"]]})

L = ["hello", "world", "stackoverflow"]

f = lambda x: Counter([y for y in x if y in L])
df['new'] = (pd.DataFrame(df['text'].apply(f).values.tolist())
               .fillna(0)
               .astype(int)
               .reindex(columns=L)
               .values
               .tolist())
print (df)

                            text        new
0                 [hello, world]  [1, 1, 0]
1  [hello, stackoverflow, hello]  [2, 0, 1]
like image 175
jezrael Avatar answered Oct 27 '25 18:10

jezrael


sklearn.feature_extraction.text.CountVectorizer can help a lot. Here's the excample of official document:

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
]
X = vectorizer.fit_transform(corpus)
X.toarray() 
/*array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
   [0, 1, 0, 1, 0, 2, 1, 0, 1],
   [1, 0, 0, 0, 1, 0, 1, 1, 0],
   [0, 1, 1, 1, 0, 0, 1, 0, 1]]...)*/

You can get the feature name with the method vectorizer.get_feature_names().

like image 28
Zhangjian Avatar answered Oct 27 '25 19:10

Zhangjian



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!