Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert a pandas DataFrame subset of columns AND rows into a numpy array?

I'm wondering if there is a simpler, memory efficient way to select a subset of rows and columns from a pandas DataFrame.

For instance, given this dataframe:

 df = DataFrame(np.random.rand(4,5), columns = list('abcde')) print df            a         b         c         d         e 0  0.945686  0.000710  0.909158  0.892892  0.326670 1  0.919359  0.667057  0.462478  0.008204  0.473096 2  0.976163  0.621712  0.208423  0.980471  0.048334 3  0.459039  0.788318  0.309892  0.100539  0.753992 

I want only those rows in which the value for column 'c' is greater than 0.5, but I only need columns 'b' and 'e' for those rows.

This is the method that I've come up with - perhaps there is a better "pandas" way?

 locs = [df.columns.get_loc(_) for _ in ['a', 'd']] print df[df.c > 0.5][locs]            a         d 0  0.945686  0.892892 

My final goal is to convert the result to a numpy array to pass into an sklearn regression algorithm, so I will use the code above like this:

 training_set = array(df[df.c > 0.5][locs]) 

... and that peeves me since I end up with a huge array copy in memory. Perhaps there's a better way for that too?

like image 766
John Prior Avatar asked Jul 16 '13 16:07

John Prior


People also ask

How do I convert a column of a pandas Dataframe into a NumPy array?

You can convert select columns of a dataframe into an numpy array using the to_numpy() method by passing the column subset of the dataframe.

Can we convert pandas Dataframe to NumPy array?

to_numpy() – Convert dataframe to Numpy array. Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). This data structure can be converted to NumPy ndarray with the help of Dataframe. to_numpy() method.

How do you convert rows into columns and columns into rows in pandas?

Pandas DataFrame: transpose() function The transpose() function is used to transpose index and columns. Reflect the DataFrame over its main diagonal by writing rows as columns and vice-versa. If True, the underlying data is copied. Otherwise (default), no copy is made if possible.

Is a pandas Dataframe the same as a NumPy array?

DataFrames and Series in PandasSeries are similar to one-dimensional NumPy arrays, with a single dtype, although with an additional index (list of row labels). DataFrames are an ordered sequence of Series, sharing the same index, with labeled columns.


2 Answers

Use its value directly:

In [79]: df[df.c > 0.5][['b', 'e']].values Out[79]:  array([[ 0.98836259,  0.82403141],        [ 0.337358  ,  0.02054435],        [ 0.29271728,  0.37813099],        [ 0.70033513,  0.69919695]]) 
like image 116
waitingkuo Avatar answered Oct 07 '22 06:10

waitingkuo


Perhaps something like this for the first problem, you can simply access the columns by their names:

>>> df = pd.DataFrame(np.random.rand(4,5), columns = list('abcde')) >>> df[df['c']>.5][['b','e']]           b         e 1  0.071146  0.132145 2  0.495152  0.420219 

For the second problem:

>>> df[df['c']>.5][['b','e']].values array([[ 0.07114556,  0.13214495],        [ 0.49515157,  0.42021946]]) 
like image 21
Daniel Avatar answered Oct 07 '22 06:10

Daniel