How to get bag of words from textual data? [closed]

Tags:

I am working on prediction problem using a large textual dataset. I am implementing Bag of Words Model.

What should be the best way to get the bag of words? Right now, I have tf-idf of the various words and the number of words is too large to use it for further assignments. If I use tf-idf criteria, what should be the tf-idf threshold for getting bag of words? Or should I use some other algorithms. I am using python.

302

asked Mar 19 '13 18:03

hshed

2 Answers

Using the collections.Counter class

>>> import collections, re
>>> texts = ['John likes to watch movies. Mary likes too.',
             'John also likes to watch football games.']
>>> bagsofwords = [collections.Counter(re.findall(r'\w+', txt))
                   for txt in texts]
>>> bagsofwords[0]
Counter({'likes': 2, 'watch': 1, 'Mary': 1, 'movies': 1, 'John': 1, 'to': 1, 'too': 1})
>>> bagsofwords[1]
Counter({'watch': 1, 'games': 1, 'to': 1, 'likes': 1, 'also': 1, 'John': 1, 'football': 1})
>>> sumbags = sum(bagsofwords, collections.Counter())
>>> sumbags
Counter({'likes': 3, 'watch': 2, 'John': 2, 'to': 2, 'games': 1, 'football': 1, 'Mary': 1, 'movies': 1, 'also': 1, 'too': 1})
>>>

answered Oct 09 '22 10:10

Paddy3118

Bag of words could be defined as a matrix where each row represents a document and columns representing the individual token. One more thing, the sequential order of text is not maintained. Building a "Bag of Words" involves 3 steps

tokenizing
counting
normalizing

Limitations to keep in mind: 1. Cannot capture phrases or multi-word expressions 2. Sensitive to misspellings, possible to work around that using a spell corrector or character representation,

e.g.

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
data_corpus = ["John likes to watch movies. Mary likes movies too.", 
"John also likes to watch football games."]
X = vectorizer.fit_transform(data_corpus) 
print(X.toarray())
print(vectorizer.get_feature_names())

answered Oct 09 '22 10:10

Pramit

Related questions
                            
                                using alt in sequence diagrams for starUML
                            
                                Access Ruby hash variables
                            
                                exporting mysql in mamp-not letting me save file as sql?
                            
                                Count NULL Values from multiple columns with SQL
                            
                                SQL query for matching multiple values in the same column
                            
                                How to mount app.get() routes on a particular path prefix
                            
                                From the timestamp in SQL, selecting records from today, yesterday, this week, this month and between two dates php mysql
                            
                                git move locally committed changes to the new branch and push
                            
                                Repeat AnimatorSet
                            
                                Disable Action Bar button in Android
                            
                                AngularJS: ng-repeat for 2 x <tr> - can't use a DIV
                            
                                How to remove td border with html?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With