Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PCA with missing values in Python

Tags:

python

numpy

pca

I'm trying to do a PCA analysis on a masked array. From what I can tell, matplotlib.mlab.PCA doesn't work if the original 2D matrix has missing values. Does anyone have recommendations for doing a PCA with missing values in Python?

Thanks.

like image 880
Emily Avatar asked Apr 02 '15 19:04

Emily


People also ask

Can you run PCA with missing values?

To achieve this goal in the case of PCA, the missing values are predicted using the iterative PCA algorithm for a predefined number of dimensions. Then, PCA is performed on the imputed data set. The single imputation step requires tuning the number of dimensions used to impute the data.

How do you treat missing values in Python?

In order to check missing values in Pandas DataFrame, we use a function isnull() and notnull(). Both function help in checking whether a value is NaN or not. These function can also be used in Pandas Series in order to find null values in a series.

How do you impute categorical missing values in Python?

One approach to imputing categorical features is to replace missing values with the most common class. You can do with by taking the index of the most common feature given in Pandas' value_counts function.


2 Answers

Imputing data will skew the result in ways that might bias the PCA estimates. A better approach is to use a PPCA algorithm, which gives the same result as PCA, but in some implementations can deal with missing data more robustly.

I have found two libraries. You have

  1. Package PPCA on PyPI, which is called PCA-magic on github
  2. Package PyPPCA, having the same name on PyPI and github

Since the packages are in low maintenance, you might want to implement it yourself instead. The code above build on theory presented in the well quoted (and well written!) paper by Tipping and Bishop 1999. It is available on Tippings home page if you want guidance on how to implement PPCA properly.

As an aside, the sklearn implementation of PCA is actually a PPCA implementation based on TippingBishop1999, but they have not chosen to implement it in such a way that it handles missing values.

EDIT: both the libraries above had issues so I could not use them directly myself. I forked PyPPCA and bug fixed it. Available on github.

like image 156
LudvigH Avatar answered Sep 19 '22 06:09

LudvigH


I think you will probably need to do some preprocessing of the data before doing PCA. You can use:

sklearn.impute.SimpleImputer

https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer

With this function you can automatically replace the missing values for the mean, median or most frequent value. Which of this options is the best is hard to tell, it depends on many factors such as how the data looks like.

By the way, you can also use PCA using the same library with:

sklearn.decomposition.PCA

http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

And many others statistical functions and machine learning tecniques.

like image 38
Numlet Avatar answered Sep 20 '22 06:09

Numlet