I'm trying a classification with python. I'm using Naive Bayes MultinomialNB classifier for the web pages (Retrieving data form web to text , later I classify this text: web classification). Now, I'm trying to apply PCA on this data, but python is giving some errors. My code for classification with Naive Bayes : <pre class="prettyprint"><code>from sklearn import PCA from sklearn import RandomizedPCA from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB vectorizer = CountVectorizer() classifer = MultinomialNB(alpha=.01) x_train = vectorizer.fit_transform(temizdata) classifer.fit(x_train, y_train) </code></pre> This naive bayes classification gives that output : <pre class="prettyprint"><code>>>> x_train <43x4429 sparse matrix of type '<class 'numpy.int64'>' with 6302 stored elements in Compressed Sparse Row format> >>> print(x_train) (0, 2966) 1 (0, 1974) 1 (0, 3296) 1 .. .. (42, 1629) 1 (42, 2833) 1 (42, 876) 1 </code></pre> Than I try to apply PCA on my data (<code>temizdata</code>) : <pre class="prettyprint"><code>>>> v_temizdata = vectorizer.fit_transform(temizdata) >>> pca_t = PCA.fit_transform(v_temizdata) >>> pca_t = PCA().fit_transform(v_temizdata) </code></pre> but this raise following erros: <blockquote> raise TypeError('A sparse matrix was passed, but dense ' TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array. </blockquote> I convert matrix to densematrix or numpy array. Then I tried to classfy new densematrix , but I have error. My main aim is that test PCA effect on Classification on text. Convert to dense array : <pre class="prettyprint"><code>v_temizdatatodense = v_temizdata.todense() pca_t = PCA().fit_transform(v_temizdatatodense) </code></pre> Finally try classfy : <pre class="prettyprint"><code>classifer.fit(pca_t,y_train) </code></pre> error for final classfy : <blockquote> raise ValueError("Input X must be non-negative") ValueError: Input X must be non-negative </blockquote> On one side my data (<code>temizdata</code>) is put in Naive Bayes only, on the other side <code>temizdata</code> firstly put in PCA (for reduce inputs) than classify. __

Rather than converting a <code>sparse</code> matrix to <code>dense</code> (which is discouraged), I would use scikits-learn's <code>TruncatedSVD</code>, which is a PCA-like dimmensionality reduction algorithm (using by default Randomized SVD) which works on sparse data: <pre class="prettyprint"><code>svd = TruncatedSVD(n_components=5, random_state=42) data = svd.fit_transform(data) </code></pre> And, citing from the <code>TruncatedSVD</code> documentation: <blockquote> In particular, truncated SVD works on term count/tf-idf matrices as returned by the vectorizers in sklearn.feature_extraction.text. In that context, it is known as latent semantic analysis (LSA). </blockquote> which is exactly your use case.

is it possible Apply PCA on any Text Classification?

Tags:

python

naivebayes

scikit-learn

pca

I'm trying a classification with python. I'm using Naive Bayes MultinomialNB classifier for the web pages (Retrieving data form web to text , later I classify this text: web classification).

Now, I'm trying to apply PCA on this data, but python is giving some errors.

My code for classification with Naive Bayes :

from sklearn import PCA
from sklearn import RandomizedPCA
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
vectorizer = CountVectorizer()
classifer = MultinomialNB(alpha=.01)

x_train = vectorizer.fit_transform(temizdata)
classifer.fit(x_train, y_train)

This naive bayes classification gives that output :

>>> x_train
<43x4429 sparse matrix of type '<class 'numpy.int64'>'
    with 6302 stored elements in Compressed Sparse Row format>

>>> print(x_train)
(0, 2966)   1
(0, 1974)   1
(0, 3296)   1
..
..
(42, 1629)  1
(42, 2833)  1
(42, 876)   1

Than I try to apply PCA on my data (temizdata) :

>>> v_temizdata = vectorizer.fit_transform(temizdata)
>>> pca_t = PCA.fit_transform(v_temizdata)
>>> pca_t = PCA().fit_transform(v_temizdata)

but this raise following erros:

raise TypeError('A sparse matrix was passed, but dense ' TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

I convert matrix to densematrix or numpy array. Then I tried to classfy new densematrix , but I have error.

My main aim is that test PCA effect on Classification on text.

Convert to dense array :

v_temizdatatodense = v_temizdata.todense()
pca_t = PCA().fit_transform(v_temizdatatodense)

Finally try classfy :

classifer.fit(pca_t,y_train)

error for final classfy :

raise ValueError("Input X must be non-negative") ValueError: Input X must be non-negative

On one side my data (temizdata) is put in Naive Bayes only, on the other side temizdata firstly put in PCA (for reduce inputs) than classify. __

370

asked Jan 11 '16 15:01

zer03

2 Answers

Rather than converting a sparse matrix to dense (which is discouraged), I would use scikits-learn's TruncatedSVD, which is a PCA-like dimmensionality reduction algorithm (using by default Randomized SVD) which works on sparse data:

svd = TruncatedSVD(n_components=5, random_state=42)
data = svd.fit_transform(data)

And, citing from the TruncatedSVD documentation:

In particular, truncated SVD works on term count/tf-idf matrices as returned by the vectorizers in sklearn.feature_extraction.text. In that context, it is known as latent semantic analysis (LSA).

which is exactly your use case.

answered Oct 29 '22 16:10

Imanol Luengo

The NaiveBayes classifier needs discrete-valued features, but the PCA breaks this property of the features. You will have to use a different classifier if you want to use PCA.

There may be other dimensionality reduction methods that work with NB, but I don't know about those. Maybe simple feature selection could work.

side note: You could try to discretize the features after applying the PCA, but I don't think this is a good idea.

answered Oct 29 '22 15:10

MB-F

Related questions
                            
                                Python gzip refuses to read uncompressed file
                            
                                Scrapy: How to manually insert a request from a spider_idle event callback?
                            
                                Writing xlwt dates with Excel 'date' format
                            
                                How do I align text output in python?
                            
                                Django : Can we use .exclude() on .get() in django querysets
                            
                                sqlalchemy.exc.CircularDependencyError: Circular dependency detected
                            
                                Python closure vs javascript closure
                            
                                Is wordnet path similarity commutative?
                            
                                pandas equivalent of Stata's encode
                            
                                How to access axis label object in matplotlib?
                            
                                Regex validation with WTForms and python
                            
                                What does a "Could not find .egg-info directory in install record" from pip mean?
                            
                                plotting multiple plots generated inside a for loop on the same axes python
                            
                                pytest -- how do I use global / session-wide fixtures?
                            
                                how to save an array as a grayscale image with matplotlib/numpy?
                            
                                Restrict static file access to logged in users
                            
                                Reindexing after pandas.drop_duplicates
                            
                                pyplot/matplotlib Bar chart with fill color depending on value
                            
                                Making multiple calls with asyncio and adding result to a dictionary
                            
                                What is sys.stdin.fileno() in python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With