Counting with scipy.sparse

Tags:

I am using the Python sklearn libraries. I have 150,000+ sentences.

I need an array-like object, where each row is for a sentences, each column corresponds to a word, and each element is the number of words in that sentence.

For example: If the two sentences were "The dog ran" and "The boy ran", I need

[ [1, 1, 1, 0]
, [0, 1, 1, 1] ]

(the order of the columns is irrelevant, and depends on which column is assigned to which word)

My array will be sparse (each sentence will have a fraction of the possible words), and so I am using scipy.sparse.

def word_counts(texts, word_map):
    w_counts = sp.???_matrix((len(texts),len(word_map)))

    for n in range(0,len(texts)-1):
        for word in re.findall(r"[\w']+", texts[n]):
            index = word_map.get(word)
            if index != None:
                w_counts[n,index] += 1
    return w_counts

...
nb = MultinomialNB() #from sklearn
words = features.word_list(texts)
nb.fit(features.word_counts(texts,words), classes)

I want to know what sparse matrix would be best.

I tried using coo_matrix but got an error:

TypeError: 'coo_matrix' object has no attribute '__getitem__'

I looked at the documentation for COO but was very confused by the following:

Sparse matrices can be used in arithmetic operations ...
Disadvantages of the COO format ... does not directly support: arithmetic operations

I used dok_matrix, and that worked, but I don't know if this performs best in this case.

Thanks in advance.

595

asked Nov 08 '12 17:11

Paul Draper

1 Answers

Try either lil_matrix or dok_matrix; those are easy to construct and inspect (but in the case of lil_matrix, potentially very slow as each insertion takes linear time). Scikit-learn estimators that accept sparse matrices will accept any format and convert them to an efficient format internally (usually csr_matrix). You can also do the conversion yourself using the methods tocoo, todok, tocsr etc. on scipy.sparse matrices.

Or, just use the CountVectorizer or DictVectorizer classes that scikit-learn provides for exactly this purpose. CountVectorizer takes entire documents as input:

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> documents = ["The dog ran", "The boy ran"]
>>> vectorizer = CountVectorizer(min_df=0)
>>> vectorizer = CountVectorizer(min_df=0, stop_words=[])
>>> X = CountVectorizer.fit_transform(documents)
>>> X = vectorizer.fit_transform(documents)
>>> X.toarray()
array([[0, 1, 1, 1],
       [1, 0, 1, 1]])

... while DictVectorizer assumes you've already done tokenization and counting, with the result of that in a dict per sample:

>>> from sklearn.feature_extraction import DictVectorizer
>>> documents = [{"the":1, "boy":1, "ran":1}, {"the":1, "dog":1, "ran":1}]
>>> X = vectorizer.fit_transform(documents)
>>> X.toarray()
array([[ 1.,  0.,  1.,  1.],
       [ 0.,  1.,  1.,  1.]])
>>> vectorizer.inverse_transform(X[0])
[{'ran': 1.0, 'boy': 1.0, 'the': 1.0}]

(The min_df argument to CountVectorizer was added a few releases ago. If you're using an old version, omit it, or rather, upgrade.)

EDIT According to the FAQ, I must disclose my affiliation, so here goes: I'm the author of DictVectorizer and I also wrote parts of CountVectorizer.

answered Sep 23 '22 04:09

Fred Foo

Related questions
                            
                                Replacement using multiple regexes or a bigger one in Python
                            
                                Is it possible to get the Python Interactive Interpreter to run a script on load?
                            
                                Options for Multiple Login (Google/Facebook/Twitter)
                            
                                Formatting nosetest output in Python
                            
                                What is the simplest way of using Python pdb to inspect the cause of an unhandled exception?
                            
                                Cygwin Newbie: How do I uninstall Python 2.6.x from Cygwin and install Python 2.7.x? [closed]
                            
                                Python standard idiom to set sys.stdout buffer to zero doesn't work with Unicode
                            
                                Cannot run as "PyDev: Django" in Eclipse
                            
                                How to "include" third party modules into my python script to make it portable?
                            
                                annotating many points with text in mayavi using mlab
                            
                                Rewrite model @property in Factory Boy's object factory
                            
                                Unexpected results of min() and max() methods of Pandas series made of Timestamp objects
                            
                                regular expression on stream instead of string?
                            
                                Rpy2 Installation issue, windows 7 [duplicate]
                            
                                update matplotlib plot
                            
                                Draw arrows between 3 points
                            
                                Get a password from the user in Fabric, not echoing the value
                            
                                Python user-defined exceptions placement and catching
                            
                                Only receiving one byte from socket
                            
                                Running Salome script without graphics

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Counting with scipy.sparse

Tags:

python

scipy

nlp

scikit-learn

sparse-matrix

Paul Draper

People also ask

1 Answers

Fred Foo

Recent Activity

Donate For Us