Say I have a matrix:
> import numpy as nap
> a = np.random.random((5,5))
array([[ 0.28164485, 0.76200749, 0.59324211, 0.15201506, 0.74084168],
[ 0.83572213, 0.63735993, 0.28039542, 0.19191284, 0.48419414],
[ 0.99967476, 0.8029097 , 0.53140614, 0.24026153, 0.94805153],
[ 0.92478 , 0.43488547, 0.76320656, 0.39969956, 0.46490674],
[ 0.83315135, 0.94781119, 0.80455425, 0.46291229, 0.70498372]])
And that I punch some holes in it with np.NaN
, e.g.:
> a[(1,4,0,3),(2,4,2,0)] = np.NaN;
array([[ 0.80327707, 0.87722234, nan, 0.94463778, 0.78089194],
[ 0.90584284, 0.18348667, nan, 0.82401826, 0.42947815],
[ 0.05913957, 0.15512961, 0.08328608, 0.97636309, 0.84573433],
[ nan, 0.30120861, 0.46829231, 0.52358888, 0.89510461],
[ 0.19877877, 0.99423591, 0.17236892, 0.88059185, nan ]])
I would like to fill-in the nan
entries using information from the rest of entries of the matrix. An example would be using the average value of the column where the nan
entries occur.
More generally, are there any libraries in Python for matrix completion ? (e.g. something along the lines of Candes & Recht's convex optimization method).
This problem appears often in machine learning. For example when working with missing features in classification/regression or in collaborative filtering (e.g. see the Netflix Problem on Wikipedia and here)
If you install the latest scikit-learn, version 0.14a1, you can use its shiny new Imputer
class:
>>> from sklearn.preprocessing import Imputer
>>> imp = Imputer(strategy="mean")
>>> a = np.random.random((5,5))
>>> a[(1,4,0,3),(2,4,2,0)] = np.nan
>>> a
array([[ 0.77473361, 0.62987193, nan, 0.11367791, 0.17633671],
[ 0.68555944, 0.54680378, nan, 0.64186838, 0.15563309],
[ 0.37784422, 0.59678177, 0.08103329, 0.60760487, 0.65288022],
[ nan, 0.54097945, 0.30680838, 0.82303869, 0.22784574],
[ 0.21223024, 0.06426663, 0.34254093, 0.22115931, nan]])
>>> a = imp.fit_transform(a)
>>> a
array([[ 0.77473361, 0.62987193, 0.24346087, 0.11367791, 0.17633671],
[ 0.68555944, 0.54680378, 0.24346087, 0.64186838, 0.15563309],
[ 0.37784422, 0.59678177, 0.08103329, 0.60760487, 0.65288022],
[ 0.51259188, 0.54097945, 0.30680838, 0.82303869, 0.22784574],
[ 0.21223024, 0.06426663, 0.34254093, 0.22115931, 0.30317394]])
After this, you can use imp.transform
to do the same transformation to other data, using the mean that imp
learned from a
. Imputers tie into scikit-learn Pipeline
objects so you can use them in classification or regression pipelines.
If you want to wait for a stable release, then 0.14 should be out next week.
Full disclosure: I'm a scikit-learn core developer
You can do it with pure numpy, but its nastier.
from scipy.stats import nanmean
>>> a
array([[ 0.70309466, 0.53785006, nan, 0.49590115, 0.23521493],
[ 0.29067786, 0.48236186, nan, 0.93220001, 0.76261019],
[ 0.66243065, 0.07731947, 0.38887545, 0.56450533, 0.58647126],
[ nan, 0.7870873 , 0.60010096, 0.88778259, 0.09097726],
[ 0.02750389, 0.72328898, 0.69820328, 0.02435883, nan]])
>>> mean=nanmean(a,axis=0)
>>> mean
array([ 0.42092677, 0.52158153, 0.56239323, 0.58094958, 0.41881841])
>>> index=np.where(np.isnan(a))
>>> a[index]=np.take(mean,index[1])
>>> a
array([[ 0.70309466, 0.53785006, 0.56239323, 0.49590115, 0.23521493],
[ 0.29067786, 0.48236186, 0.56239323, 0.93220001, 0.76261019],
[ 0.66243065, 0.07731947, 0.38887545, 0.56450533, 0.58647126],
[ 0.42092677, 0.7870873 , 0.60010096, 0.88778259, 0.09097726],
[ 0.02750389, 0.72328898, 0.69820328, 0.02435883, 0.41881841]])
Running some timings:
import time
import numpy as np
import pandas as pd
from scipy.stats import nanmean
a = np.random.random((10000,10000))
col=np.random.randint(0,10000,500)
row=np.random.randint(0,10000,500)
a[(col,row)]=np.nan
a1=np.copy(a)
%timeit mean=nanmean(a,axis=0);index=np.where(np.isnan(a));a[index]=np.take(mean,index[1])
1 loops, best of 3: 1.84 s per loop
%timeit DF=pd.DataFrame(a1);col_means = DF.apply(np.mean, 0);DF.fillna(value=col_means)
1 loops, best of 3: 5.81 s per loop
#Surprisingly, issue could be apply looping over the zero axis.
DF=pd.DataFrame(a2)
%timeit col_means = DF.apply(np.mean, 0);DF.fillna(value=col_means)
1 loops, best of 3: 5.57 s per loop
I do not believe numpy has array completion routines built in; however, pandas does. View the help topic here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With