Matrix completion in Python

Question

Say I have a matrix:

> import numpy as nap
> a = np.random.random((5,5))

array([[ 0.28164485,  0.76200749,  0.59324211,  0.15201506,  0.74084168],
       [ 0.83572213,  0.63735993,  0.28039542,  0.19191284,  0.48419414],
       [ 0.99967476,  0.8029097 ,  0.53140614,  0.24026153,  0.94805153],
       [ 0.92478   ,  0.43488547,  0.76320656,  0.39969956,  0.46490674],
       [ 0.83315135,  0.94781119,  0.80455425,  0.46291229,  0.70498372]])

And that I punch some holes in it with np.NaN, e.g.:

> a[(1,4,0,3),(2,4,2,0)] = np.NaN; 

array([[ 0.80327707,  0.87722234,         nan,  0.94463778,  0.78089194],
       [ 0.90584284,  0.18348667,         nan,  0.82401826,  0.42947815],
       [ 0.05913957,  0.15512961,  0.08328608,  0.97636309,  0.84573433],
       [        nan,  0.30120861,  0.46829231,  0.52358888,  0.89510461],
       [ 0.19877877,  0.99423591,  0.17236892,  0.88059185,        nan ]])

I would like to fill-in the nan entries using information from the rest of entries of the matrix. An example would be using the average value of the column where the nan entries occur.

More generally, are there any libraries in Python for matrix completion ? (e.g. something along the lines of Candes & Recht's convex optimization method).

Background:

This problem appears often in machine learning. For example when working with missing features in classification/regression or in collaborative filtering (e.g. see the Netflix Problem on Wikipedia and here)

Fred Foo · Accepted Answer

If you install the latest scikit-learn, version 0.14a1, you can use its shiny new Imputer class:

>>> from sklearn.preprocessing import Imputer
>>> imp = Imputer(strategy="mean")
>>> a = np.random.random((5,5))
>>> a[(1,4,0,3),(2,4,2,0)] = np.nan
>>> a
array([[ 0.77473361,  0.62987193,         nan,  0.11367791,  0.17633671],
       [ 0.68555944,  0.54680378,         nan,  0.64186838,  0.15563309],
       [ 0.37784422,  0.59678177,  0.08103329,  0.60760487,  0.65288022],
       [        nan,  0.54097945,  0.30680838,  0.82303869,  0.22784574],
       [ 0.21223024,  0.06426663,  0.34254093,  0.22115931,         nan]])
>>> a = imp.fit_transform(a)
>>> a
array([[ 0.77473361,  0.62987193,  0.24346087,  0.11367791,  0.17633671],
       [ 0.68555944,  0.54680378,  0.24346087,  0.64186838,  0.15563309],
       [ 0.37784422,  0.59678177,  0.08103329,  0.60760487,  0.65288022],
       [ 0.51259188,  0.54097945,  0.30680838,  0.82303869,  0.22784574],
       [ 0.21223024,  0.06426663,  0.34254093,  0.22115931,  0.30317394]])

After this, you can use imp.transform to do the same transformation to other data, using the mean that imp learned from a. Imputers tie into scikit-learn Pipeline objects so you can use them in classification or regression pipelines.

If you want to wait for a stable release, then 0.14 should be out next week.

Full disclosure: I'm a scikit-learn core developer

Daniel · Answer

You can do it with pure numpy, but its nastier.

from scipy.stats import nanmean
>>> a
array([[ 0.70309466,  0.53785006,         nan,  0.49590115,  0.23521493],
       [ 0.29067786,  0.48236186,         nan,  0.93220001,  0.76261019],
       [ 0.66243065,  0.07731947,  0.38887545,  0.56450533,  0.58647126],
       [        nan,  0.7870873 ,  0.60010096,  0.88778259,  0.09097726],
       [ 0.02750389,  0.72328898,  0.69820328,  0.02435883,         nan]])


>>> mean=nanmean(a,axis=0)
>>> mean
array([ 0.42092677,  0.52158153,  0.56239323,  0.58094958,  0.41881841])
>>> index=np.where(np.isnan(a))

>>> a[index]=np.take(mean,index[1])
>>> a
array([[ 0.70309466,  0.53785006,  0.56239323,  0.49590115,  0.23521493],
       [ 0.29067786,  0.48236186,  0.56239323,  0.93220001,  0.76261019],
       [ 0.66243065,  0.07731947,  0.38887545,  0.56450533,  0.58647126],
       [ 0.42092677,  0.7870873 ,  0.60010096,  0.88778259,  0.09097726],
       [ 0.02750389,  0.72328898,  0.69820328,  0.02435883,  0.41881841]])

Running some timings:

import time
import numpy as np
import pandas as pd
from scipy.stats import nanmean

a = np.random.random((10000,10000))
col=np.random.randint(0,10000,500)
row=np.random.randint(0,10000,500)
a[(col,row)]=np.nan
a1=np.copy(a)


%timeit mean=nanmean(a,axis=0);index=np.where(np.isnan(a));a[index]=np.take(mean,index[1])
1 loops, best of 3: 1.84 s per loop

%timeit DF=pd.DataFrame(a1);col_means = DF.apply(np.mean, 0);DF.fillna(value=col_means)
1 loops, best of 3: 5.81 s per loop

#Surprisingly, issue could be apply looping over the zero axis.
DF=pd.DataFrame(a2)
%timeit col_means = DF.apply(np.mean, 0);DF.fillna(value=col_means)
1 loops, best of 3: 5.57 s per loop

I do not believe numpy has array completion routines built in; however, pandas does. View the help topic here.

Matrix completion in Python

Tags:

python

machine-learning

numpy

mathematical-optimization

scikit-learn

Background:

Amelio Vazquez-Reina

2 Answers

Fred Foo

Daniel

Recent Activity

Donate For Us

Matrix completion in Python

Tags:

python

machine-learning

numpy

mathematical-optimization

scikit-learn

Background:

Amelio Vazquez-Reina

2 Answers

Fred Foo

Daniel

Related questions

Recent Activity

Donate For Us