I'm trying a classification with python. I'm using Naive Bayes MultinomialNB classifier for the web pages (Retrieving data form web to text , later I classify this text: web classification).
Now, I'm trying to apply PCA on this data, but python is giving some errors.
My code for classification with Naive Bayes :
from sklearn import PCA
from sklearn import RandomizedPCA
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
vectorizer = CountVectorizer()
classifer = MultinomialNB(alpha=.01)
x_train = vectorizer.fit_transform(temizdata)
classifer.fit(x_train, y_train)
This naive bayes classification gives that output :
>>> x_train
<43x4429 sparse matrix of type '<class 'numpy.int64'>'
with 6302 stored elements in Compressed Sparse Row format>
>>> print(x_train)
(0, 2966) 1
(0, 1974) 1
(0, 3296) 1
..
..
(42, 1629) 1
(42, 2833) 1
(42, 876) 1
Than I try to apply PCA on my data (temizdata
) :
>>> v_temizdata = vectorizer.fit_transform(temizdata)
>>> pca_t = PCA.fit_transform(v_temizdata)
>>> pca_t = PCA().fit_transform(v_temizdata)
but this raise following erros:
raise TypeError('A sparse matrix was passed, but dense ' TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
I convert matrix to densematrix or numpy array. Then I tried to classfy new densematrix , but I have error.
My main aim is that test PCA effect on Classification on text.
Convert to dense array :
v_temizdatatodense = v_temizdata.todense()
pca_t = PCA().fit_transform(v_temizdatatodense)
Finally try classfy :
classifer.fit(pca_t,y_train)
error for final classfy :
raise ValueError("Input X must be non-negative") ValueError: Input X must be non-negative
On one side my data (temizdata
) is put in Naive Bayes only, on the other side temizdata
firstly put in PCA (for reduce inputs) than classify.
__
Principal Component Analysis (PCA) is a widely adopted method in pattern recognition and signal processing. PCA is effective in data compression and feature extraction【12,13,14】. It's natural for us to apply PCA in text categorization to get the low-dimension representation of document vectors.
Principal Component Analysis (PCA) is a great tool used by data scientists. It can be used to reduce feature space dimensionality and produce uncorrelated features. As we will see, it can also help you gain insight into the classification power of your data.
Unlike ordinary principal component (PCA) which has been extensively used for NLP, sparse PCA is not well investigated in this field. Word embeddings resulting from NLP models are shown to be a great asset for a wide variety of NLP tasks [2].
Conclusion. Principal Component Analysis (PCA) is very useful to speed up the computation by reducing the dimensionality of the data. Plus, when you have high dimensionality with high correlated variable of one another, the PCA can improve the accuracy of classification model.
Rather than converting a sparse
matrix to dense
(which is discouraged), I would use scikits-learn's TruncatedSVD
, which is a PCA-like dimmensionality reduction algorithm (using by default Randomized SVD) which works on sparse data:
svd = TruncatedSVD(n_components=5, random_state=42)
data = svd.fit_transform(data)
And, citing from the TruncatedSVD
documentation:
In particular, truncated SVD works on term count/tf-idf matrices as returned by the vectorizers in sklearn.feature_extraction.text. In that context, it is known as latent semantic analysis (LSA).
which is exactly your use case.
The NaiveBayes
classifier needs discrete-valued features, but the PCA breaks this property of the features. You will have to use a different classifier if you want to use PCA.
There may be other dimensionality reduction methods that work with NB, but I don't know about those. Maybe simple feature selection could work.
side note: You could try to discretize the features after applying the PCA, but I don't think this is a good idea.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With