PCA with missing values in Python

Tags:

I'm trying to do a PCA analysis on a masked array. From what I can tell, matplotlib.mlab.PCA doesn't work if the original 2D matrix has missing values. Does anyone have recommendations for doing a PCA with missing values in Python?

Thanks.

880

asked Apr 02 '15 19:04

Emily

2 Answers

Imputing data will skew the result in ways that might bias the PCA estimates. A better approach is to use a PPCA algorithm, which gives the same result as PCA, but in some implementations can deal with missing data more robustly.

I have found two libraries. You have

Package PPCA on PyPI, which is called PCA-magic on github
Package PyPPCA, having the same name on PyPI and github

Since the packages are in low maintenance, you might want to implement it yourself instead. The code above build on theory presented in the well quoted (and well written!) paper by Tipping and Bishop 1999. It is available on Tippings home page if you want guidance on how to implement PPCA properly.

As an aside, the sklearn implementation of PCA is actually a PPCA implementation based on TippingBishop1999, but they have not chosen to implement it in such a way that it handles missing values.

EDIT: both the libraries above had issues so I could not use them directly myself. I forked PyPPCA and bug fixed it. Available on github.

156

answered Sep 19 '22 06:09

LudvigH

I think you will probably need to do some preprocessing of the data before doing PCA. You can use:

Click to copy

sklearn.impute.SimpleImputer

https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer

With this function you can automatically replace the missing values for the mean, median or most frequent value. Which of this options is the best is hard to tell, it depends on many factors such as how the data looks like.

By the way, you can also use PCA using the same library with:

Click to copy

sklearn.decomposition.PCA

http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

And many others statistical functions and machine learning tecniques.

answered Sep 20 '22 06:09

Numlet

Related questions
                            
                                Error in GAE with ndb - BadQueryError: Cannot convert FalseNode to predicate
                            
                                Can't pretty print json from python
                            
                                In the Pyramid web framework, how do I source sensitive settings into development.ini / production.ini from an external file?
                            
                                Same value for id(float)
                            
                                Using window functions to LIMIT a query with SqlAlchemy on Postgres
                            
                                Creating DataFrame with Hierarchical Columns
                            
                                how to install cloud9 IDE on ubuntu server
                            
                                Python os.stat(file_name).st_size versus os.path.getsize(file_name)
                            
                                extrapolating data with numpy/python
                            
                                Python - is there any way to organize a group of yields in sub function to yield outside the main function?
                            
                                Matrix multiplication, solve Ax = b solve for x
                            
                                Select specific CSV columns (Filtering) - Python/pandas
                            
                                Openpyxl and Hidden/Unhidden Excel Worksheets
                            
                                How to check that variable is a lambda function
                            
                                Different x and y scale in zoomed inset, matplotlib
                            
                                How to get Python to use Assembly
                            
                                Can pytest fixtures be combined?
                            
                                Why should i use vagrant if i use virtualenv?
                            
                                Index pandas DataFrame by column numbers, when column names are integers
                            
                                sklearn Kfold acces single fold instead of for loop

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

PCA with missing values in Python

Tags:

python

numpy

pca

Emily

People also ask

2 Answers

LudvigH

Numlet

Recent Activity

Donate For Us