Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas: fill a column with some numpy arrays

Tags:

python

pandas

I am using python2.7 and pandas 0.11.0.

I try to fill a column of a dataframe using DataFrame.apply(func). The func() function is supposed to return a numpy array (1x3).

import pandas as pd
import numpy as np

df= pd.DataFrame(np.random.randn(4, 3), columns=list('ABC'))
print(df)

              A         B         C
    0  0.910142  0.788300  0.114164
    1 -0.603282 -0.625895  2.843130
    2  1.823752 -0.091736 -0.107781
    3  0.447743 -0.163605  0.514052

The function used for testing purpose:

def test(row):
   # some complex calc here 
   # based on the values from different columns 
   return np.array((1,2,3))

df['D'] = df.apply(test, axis=1)

[...]
ValueError: Wrong number of items passed 1, indices imply 3

The funny is that when I create the dataframe from scratch, it works pretty well, and returns as expected:

dic = {'A': {0: 0.9, 1: -0.6, 2: 1.8, 3: 0.4}, 
     'C': {0: 0.1, 1: 2.8, 2: -0.1, 3: 0.5}, 
     'B': {0: 0.7, 1: -0.6, 2: -0.1, 3: -0.1},
     'D': {0:np.array((1,2,3)), 
          1:np.array((1,2,3)), 
          2:np.array((1,2,3)), 
          3:np.array((1,2,3))}}

df= pd.DataFrame(dic)
print(df)
         A    B    C          D
    0  0.9  0.7  0.1  [1, 2, 3]
    1 -0.6 -0.6  2.8  [1, 2, 3]
    2  1.8 -0.1 -0.1  [1, 2, 3]
    3  0.4 -0.1  0.5  [1, 2, 3]

Thanks in advance

like image 769
Nic Avatar asked Sep 05 '13 16:09

Nic


People also ask

Can pandas DataFrame hold NumPy array?

For most data types, pandas uses NumPy arrays as the concrete objects contained with a Index , Series , or DataFrame .

How do I fill an array in NumPy?

fill() method is used to fill the numpy array with a scalar value. If we have to initialize a numpy array with an identical value then we use numpy. ndarray. fill().

How to add NumPy array to Dataframe in pandas Dataframe?

To store a numpy array into the cell of the dataframe, we will pass the name of the cell in square brackets [] and assign a numpy array to this cell. To add rows to dataframe 1. Add numpy array to Pandas Dataframe as column

How to fill in missing values in a pandas Dataframe?

Software Tutorials The pandas fillna () function is useful for filling in missing values in columns of a pandas DataFrame. This tutorial provides several examples of how to use this function to fill in missing values for multiple columns of the following pandas DataFrame:

What is NumPy fill() function in Python?

That’s all for numpy.fill () it’s a very simple, very powerful, and very useful function. With numpy.full () we can combine the two lines of code from the last section (one line to create an empty array, and one line to fill the array with a value) into a single function.

How to replace NaN values in specific columns of a pandas Dataframe?

You can use the following methods with fillna () to replace NaN values in specific columns of a pandas DataFrame: This tutorial explains how to use this function with the following pandas DataFrame:


1 Answers

If you try to return multiple values from the function that is passed to apply, and the DataFrame you call the apply on has the same number of item along the axis (in this case columns) as the number of values you returned, Pandas will create a DataFrame from the return values with the same labels as the original DataFrame. You can see this if you just do:

>>> def test(row):
        return [1, 2, 3]
>>> df= pd.DataFrame(np.random.randn(4, 3), columns=list('ABC'))
>>> df.apply(test, axis=1)
   A  B  C
0  1  2  3
1  1  2  3
2  1  2  3
3  1  2  3

And that is why you get the error, since you cannot assign a DataFrame to DataFrame column.

If you return any other number of values, it will return just a series object, that can be assigned:

>>> def test(row):
       return [1, 2]
>>> df= pd.DataFrame(np.random.randn(4, 3), columns=list('ABC'))
>>> df.apply(test, axis=1)
0    [1, 2]
1    [1, 2]
2    [1, 2]
3    [1, 2]
>>> df['D'] = df.apply(test, axis=1)
>>> df
          A         B         C       D
0  0.333535  0.209745 -0.972413  [1, 2]
1  0.469590  0.107491 -1.248670  [1, 2]
2  0.234444  0.093290 -0.853348  [1, 2]
3  1.021356  0.092704 -0.406727  [1, 2]

I'm not sure why Pandas does this, and why it does it only when the return value is a list or an ndarray, since it won't do it if you return a tuple:

>>> def test(row):
        return (1, 2, 3)
>>> df= pd.DataFrame(np.random.randn(4, 3), columns=list('ABC'))
>>> df['D'] = df.apply(test, axis=1)
>>> df
          A         B         C          D
0  0.121136  0.541198 -0.281972  (1, 2, 3)
1  0.569091  0.944344  0.861057  (1, 2, 3)
2 -1.742484 -0.077317  0.181656  (1, 2, 3)
3 -1.541244  0.174428  0.660123  (1, 2, 3)
like image 123
Viktor Kerkez Avatar answered Oct 02 '22 22:10

Viktor Kerkez