TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0] while using RF classifier?

Tags:

I am learning about random forests in scikit learn and as an example I would like to use Random forest classifier for text classification, with my own dataset. So first I vectorized the text with tfidf and for classification:

from sklearn.ensemble import RandomForestClassifier
classifier=RandomForestClassifier(n_estimators=10) 
classifier.fit(X_train, y_train)           
prediction = classifier.predict(X_test)

When I run the classification I got this:

TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

then I used the .toarray() for X_train and I got the following:

TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]

From a previous question as I understood I need to reduce the dimensionality of the numpy array so I do the same:

from sklearn.decomposition.truncated_svd import TruncatedSVD        
pca = TruncatedSVD(n_components=300)                                
X_reduced_train = pca.fit_transform(X_train)               

from sklearn.ensemble import RandomForestClassifier                 
classifier=RandomForestClassifier(n_estimators=10)                  
classifier.fit(X_reduced_train, y_train)                            
prediction = classifier.predict(X_testing)

Then I got this exception:

  File "/usr/local/lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 419, in predict
    n_samples = len(X)
  File "/usr/local/lib/python2.7/site-packages/scipy/sparse/base.py", line 192, in __len__
    raise TypeError("sparse matrix length is ambiguous; use getnnz()"
TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]

The I tried the following:

prediction = classifier.predict(X_train.getnnz())

And got this:

  File "/usr/local/lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 419, in predict
    n_samples = len(X)
TypeError: object of type 'int' has no len()

Two questions were raised from this: How can I use Random forests to classify correctly? and what's happening with X_train?.

Then I tried the following:

df = pd.read_csv('/path/file.csv',
header=0, sep=',', names=['id', 'text', 'label'])



X = tfidf_vect.fit_transform(df['text'].values)
y = df['label'].values



from sklearn.decomposition.truncated_svd import TruncatedSVD
pca = TruncatedSVD(n_components=2)
X = pca.fit_transform(X)

a_train, a_test, b_train, b_test = train_test_split(X, y, test_size=0.33, random_state=42)

from sklearn.ensemble import RandomForestClassifier

classifier=RandomForestClassifier(n_estimators=10)
classifier.fit(a_train, b_train)
prediction = classifier.predict(a_test)

from sklearn.metrics.metrics import precision_score, recall_score, confusion_matrix, classification_report
print '\nscore:', classifier.score(a_train, b_test)
print '\nprecision:', precision_score(b_test, prediction)
print '\nrecall:', recall_score(b_test, prediction)
print '\n confussion matrix:\n',confusion_matrix(b_test, prediction)
print '\n clasification report:\n', classification_report(b_test, prediction)

935

asked Feb 04 '15 05:02

tumbleweed

2 Answers

I don't know much about sklearn, though I vaguely recall some earlier issue triggered by a switch to using sparse matricies. Internally some of the matrices had to replaced by m.toarray() or m.todense().

But to give you an idea of what the error message was about, consider

In [907]: A=np.array([[0,1],[3,4]])
In [908]: M=sparse.coo_matrix(A)
In [909]: len(A)
Out[909]: 2
In [910]: len(M)
...
TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]

In [911]: A.shape[0]
Out[911]: 2
In [912]: M.shape[0]
Out[912]: 2

len() usually is used in Python to count the number of 1st level terms of a list. When applied to a 2d array, it is the number of rows. But A.shape[0] is a better way of counting the rows. And M.shape[0] is the same. In this case you aren't interested in .getnnz, which is the number of nonzero terms of a sparse matrix. A doesn't have this method, though can be derived from A.nonzero().

answered Sep 21 '22 05:09

hpaulj

It is a bit unclear if you are passing the same data structure (type and shape) to the fit method and predict method of the classifier. Random forests will take a long time to run with a large number of features, hence the suggestion to reduce the dimensionality in the post you link to.

You should apply the SVD to both the training and test data so the classifier in trained on the same shaped input as the data you wish to predict for. Check the input to the fit, and the input to the predict method have the same number of features, and are both arrays rather than sparse matrices.

updated with example: updated to use dataframe

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect= TfidfVectorizer(  use_idf=True, smooth_idf=True, sublinear_tf=False)
from sklearn.cross_validation import train_test_split

df= pd.DataFrame({'text':['cat on the','angel eyes has','blue red angel','one two blue','blue whales eat','hot tin roof','angel eyes has','have a cat']\
              ,'class': [0,0,0,1,1,1,0,3]})



X = tfidf_vect.fit_transform(df['text'].values)
y = df['class'].values

from sklearn.decomposition.truncated_svd import TruncatedSVD        
pca = TruncatedSVD(n_components=2)                                
X_reduced_train = pca.fit_transform(X)  

a_train, a_test, b_train, b_test = train_test_split(X, y, test_size=0.33, random_state=42)

from sklearn.ensemble import RandomForestClassifier 

classifier=RandomForestClassifier(n_estimators=10)                  
classifier.fit(a_train.toarray(), b_train)                            
prediction = classifier.predict(a_test.toarray())

Note the SVD happens before the split into training and test sets, so that the array passed to the predictor has the same n as the array the fit method is called on.

answered Sep 24 '22 05:09

JAB

Related questions
                            
                                Can I get SQLAlchemy to populate a relationship based on the current foreign key values?
                            
                                How can I change my PyPI username?
                            
                                What exactly does win32com.client.Dispatch("WScript.Shell")?
                            
                                Python sorting complexity on sorted list
                            
                                How to match all alphanumeric except underscore on Python
                            
                                Smoothed 2D histogram using matplotlib and imshow
                            
                                Import JSON data into Python [duplicate]
                            
                                UnicodeDecodeError: 'utf-8' codec can't decode byte error
                            
                                MySql cursors.execute() with only one parameter: Why is a string sliced into a list?
                            
                                SQLAlchemy: several counts in one query
                            
                                How to upload folder on Google Cloud Storage using Python API
                            
                                How to debug python script that is crashing python
                            
                                Playing audio in pydub
                            
                                pygame.error: video system not initialized
                            
                                Python - dir() - how can I differentiate between functions/method and simple attributes?
                            
                                Construct Pandas DataFrame from dictionary in form {index: list of row values}
                            
                                What do the different values of the kind argument mean in scipy.interpolate.interp1d?
                            
                                Split Python sequence (time series/array) into subsequences with overlap
                            
                                Pandas filtering - between_time on a non-index column
                            
                                Scrapy grab div with multiple classes?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0] while using RF classifier?

Tags:

python

machine-learning

numpy

nlp

scikit-learn

tumbleweed

People also ask

2 Answers

hpaulj

JAB

Recent Activity

Donate For Us