Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get bag of words from textual data? [closed]

Tags:

I am working on prediction problem using a large textual dataset. I am implementing Bag of Words Model.

What should be the best way to get the bag of words? Right now, I have tf-idf of the various words and the number of words is too large to use it for further assignments. If I use tf-idf criteria, what should be the tf-idf threshold for getting bag of words? Or should I use some other algorithms. I am using python.

like image 302
hshed Avatar asked Mar 19 '13 18:03

hshed


People also ask

How do you calculate bag words?

Count the number of times each word appears in a document. Frequencies. Calculate the frequency that each word appears in a document out of all the words in the document.

How do you make a word model bag?

To create the bag of words model, we need to create a matrix where the columns correspond to the most frequent words in our dictionary where rows correspond to the document or sentences. It is important to understand how the above matrix is created. In the above matrix, the first row corresponds to the first sentence.

What type of data does bag-of-words represent?

A bag of words is a representation of text that describes the occurrence of words within a document.

How do I generate Tfidf for the bag-of-words?

Inverse document frequency refers to the log of the total number of documents divided by the number of documents that contain the word. The logarithm is added to dampen the importance of a very high value of IDF. TFIDF is computed by multiplying the term frequency with the inverse document frequency.


2 Answers

Using the collections.Counter class

>>> import collections, re
>>> texts = ['John likes to watch movies. Mary likes too.',
             'John also likes to watch football games.']
>>> bagsofwords = [collections.Counter(re.findall(r'\w+', txt))
                   for txt in texts]
>>> bagsofwords[0]
Counter({'likes': 2, 'watch': 1, 'Mary': 1, 'movies': 1, 'John': 1, 'to': 1, 'too': 1})
>>> bagsofwords[1]
Counter({'watch': 1, 'games': 1, 'to': 1, 'likes': 1, 'also': 1, 'John': 1, 'football': 1})
>>> sumbags = sum(bagsofwords, collections.Counter())
>>> sumbags
Counter({'likes': 3, 'watch': 2, 'John': 2, 'to': 2, 'games': 1, 'football': 1, 'Mary': 1, 'movies': 1, 'also': 1, 'too': 1})
>>> 
like image 94
Paddy3118 Avatar answered Oct 09 '22 10:10

Paddy3118


Bag of words could be defined as a matrix where each row represents a document and columns representing the individual token. One more thing, the sequential order of text is not maintained. Building a "Bag of Words" involves 3 steps

  1. tokenizing
  2. counting
  3. normalizing

Limitations to keep in mind: 1. Cannot capture phrases or multi-word expressions 2. Sensitive to misspellings, possible to work around that using a spell corrector or character representation,

e.g.

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
data_corpus = ["John likes to watch movies. Mary likes movies too.", 
"John also likes to watch football games."]
X = vectorizer.fit_transform(data_corpus) 
print(X.toarray())
print(vectorizer.get_feature_names())
like image 30
Pramit Avatar answered Oct 09 '22 10:10

Pramit