Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

is it possible Apply PCA on any Text Classification?

I'm trying a classification with python. I'm using Naive Bayes MultinomialNB classifier for the web pages (Retrieving data form web to text , later I classify this text: web classification).

Now, I'm trying to apply PCA on this data, but python is giving some errors.

My code for classification with Naive Bayes :

from sklearn import PCA
from sklearn import RandomizedPCA
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
vectorizer = CountVectorizer()
classifer = MultinomialNB(alpha=.01)

x_train = vectorizer.fit_transform(temizdata)
classifer.fit(x_train, y_train)

This naive bayes classification gives that output :

>>> x_train
<43x4429 sparse matrix of type '<class 'numpy.int64'>'
    with 6302 stored elements in Compressed Sparse Row format>

>>> print(x_train)
(0, 2966)   1
(0, 1974)   1
(0, 3296)   1
..
..
(42, 1629)  1
(42, 2833)  1
(42, 876)   1

Than I try to apply PCA on my data (temizdata) :

>>> v_temizdata = vectorizer.fit_transform(temizdata)
>>> pca_t = PCA.fit_transform(v_temizdata)
>>> pca_t = PCA().fit_transform(v_temizdata)

but this raise following erros:

raise TypeError('A sparse matrix was passed, but dense ' TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

I convert matrix to densematrix or numpy array. Then I tried to classfy new densematrix , but I have error.

My main aim is that test PCA effect on Classification on text.

Convert to dense array :

v_temizdatatodense = v_temizdata.todense()
pca_t = PCA().fit_transform(v_temizdatatodense)

Finally try classfy :

classifer.fit(pca_t,y_train)

error for final classfy :

raise ValueError("Input X must be non-negative") ValueError: Input X must be non-negative

On one side my data (temizdata) is put in Naive Bayes only, on the other side temizdata firstly put in PCA (for reduce inputs) than classify. __

like image 370
zer03 Avatar asked Jan 11 '16 15:01

zer03


People also ask

Can we use PCA for text classification?

Principal Component Analysis (PCA) is a widely adopted method in pattern recognition and signal processing. PCA is effective in data compression and feature extraction【12,13,14】. It's natural for us to apply PCA in text categorization to get the low-dimension representation of document vectors.

Is principal component analysis is used for classification?

Principal Component Analysis (PCA) is a great tool used by data scientists. It can be used to reduce feature space dimensionality and produce uncorrelated features. As we will see, it can also help you gain insight into the classification power of your data.

Can PCA be used in NLP?

Unlike ordinary principal component (PCA) which has been extensively used for NLP, sparse PCA is not well investigated in this field. Word embeddings resulting from NLP models are shown to be a great asset for a wide variety of NLP tasks [2].

Does PCA improve classification?

Conclusion. Principal Component Analysis (PCA) is very useful to speed up the computation by reducing the dimensionality of the data. Plus, when you have high dimensionality with high correlated variable of one another, the PCA can improve the accuracy of classification model.


2 Answers

Rather than converting a sparse matrix to dense (which is discouraged), I would use scikits-learn's TruncatedSVD, which is a PCA-like dimmensionality reduction algorithm (using by default Randomized SVD) which works on sparse data:

svd = TruncatedSVD(n_components=5, random_state=42)
data = svd.fit_transform(data) 

And, citing from the TruncatedSVD documentation:

In particular, truncated SVD works on term count/tf-idf matrices as returned by the vectorizers in sklearn.feature_extraction.text. In that context, it is known as latent semantic analysis (LSA).

which is exactly your use case.

like image 73
Imanol Luengo Avatar answered Oct 29 '22 16:10

Imanol Luengo


The NaiveBayes classifier needs discrete-valued features, but the PCA breaks this property of the features. You will have to use a different classifier if you want to use PCA.

There may be other dimensionality reduction methods that work with NB, but I don't know about those. Maybe simple feature selection could work.

side note: You could try to discretize the features after applying the PCA, but I don't think this is a good idea.

like image 27
MB-F Avatar answered Oct 29 '22 15:10

MB-F