Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Counting with scipy.sparse

I am using the Python sklearn libraries. I have 150,000+ sentences.

I need an array-like object, where each row is for a sentences, each column corresponds to a word, and each element is the number of words in that sentence.

For example: If the two sentences were "The dog ran" and "The boy ran", I need

[ [1, 1, 1, 0]
, [0, 1, 1, 1] ]

(the order of the columns is irrelevant, and depends on which column is assigned to which word)

My array will be sparse (each sentence will have a fraction of the possible words), and so I am using scipy.sparse.

def word_counts(texts, word_map):
    w_counts = sp.???_matrix((len(texts),len(word_map)))

    for n in range(0,len(texts)-1):
        for word in re.findall(r"[\w']+", texts[n]):
            index = word_map.get(word)
            if index != None:
                w_counts[n,index] += 1
    return w_counts

...
nb = MultinomialNB() #from sklearn
words = features.word_list(texts)
nb.fit(features.word_counts(texts,words), classes)

I want to know what sparse matrix would be best.

I tried using coo_matrix but got an error:

TypeError: 'coo_matrix' object has no attribute '__getitem__'

I looked at the documentation for COO but was very confused by the following:

Sparse matrices can be used in arithmetic operations ...
Disadvantages of the COO format ... does not directly support: arithmetic operations

I used dok_matrix, and that worked, but I don't know if this performs best in this case.

Thanks in advance.

like image 595
Paul Draper Avatar asked Nov 08 '12 17:11

Paul Draper


People also ask

How does SciPy sparse work?

Python's SciPy provides tools for creating sparse matrices using multiple data structures, as well as tools for converting a dense matrix to a sparse matrix. The sparse matrix representation outputs the row-column tuple where the matrix contains non-zero values along with those values.

What does SciPy sparse Csr_matrix do?

The function csr_matrix() is used to create a sparse matrix of compressed sparse row format whereas csc_matrix() is used to create a sparse matrix of compressed sparse column format.

How do you find non-zero entries in a sparse matrix?

Location and Count of Nonzeros Create a 10-by-10 random sparse matrix with 7% density of nonzeros. A = sprand(10,10,0.07); Use nonzeros to find the values of the nonzero elements. Use nnz to count the number of nonzeros.

What is sparse Coo_matrix?

A sparse matrix in COOrdinate format. Also known as the 'ijv' or 'triplet' format. This can be instantiated in several ways: coo_matrix(D) with a dense matrix D coo_matrix(S) with another sparse matrix S (equivalent to S.tocoo()) coo_matrix((M, N), [dtype])


1 Answers

Try either lil_matrix or dok_matrix; those are easy to construct and inspect (but in the case of lil_matrix, potentially very slow as each insertion takes linear time). Scikit-learn estimators that accept sparse matrices will accept any format and convert them to an efficient format internally (usually csr_matrix). You can also do the conversion yourself using the methods tocoo, todok, tocsr etc. on scipy.sparse matrices.

Or, just use the CountVectorizer or DictVectorizer classes that scikit-learn provides for exactly this purpose. CountVectorizer takes entire documents as input:

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> documents = ["The dog ran", "The boy ran"]
>>> vectorizer = CountVectorizer(min_df=0)
>>> vectorizer = CountVectorizer(min_df=0, stop_words=[])
>>> X = CountVectorizer.fit_transform(documents)
>>> X = vectorizer.fit_transform(documents)
>>> X.toarray()
array([[0, 1, 1, 1],
       [1, 0, 1, 1]])

... while DictVectorizer assumes you've already done tokenization and counting, with the result of that in a dict per sample:

>>> from sklearn.feature_extraction import DictVectorizer
>>> documents = [{"the":1, "boy":1, "ran":1}, {"the":1, "dog":1, "ran":1}]
>>> X = vectorizer.fit_transform(documents)
>>> X.toarray()
array([[ 1.,  0.,  1.,  1.],
       [ 0.,  1.,  1.,  1.]])
>>> vectorizer.inverse_transform(X[0])
[{'ran': 1.0, 'boy': 1.0, 'the': 1.0}]

(The min_df argument to CountVectorizer was added a few releases ago. If you're using an old version, omit it, or rather, upgrade.)

EDIT According to the FAQ, I must disclose my affiliation, so here goes: I'm the author of DictVectorizer and I also wrote parts of CountVectorizer.

like image 89
Fred Foo Avatar answered Sep 23 '22 04:09

Fred Foo