pandas: fill a column with some numpy arrays

Tags:

pandas

I am using python2.7 and pandas 0.11.0.

I try to fill a column of a dataframe using DataFrame.apply(func). The func() function is supposed to return a numpy array (1x3).

import pandas as pd
import numpy as np

df= pd.DataFrame(np.random.randn(4, 3), columns=list('ABC'))
print(df)

              A         B         C
    0  0.910142  0.788300  0.114164
    1 -0.603282 -0.625895  2.843130
    2  1.823752 -0.091736 -0.107781
    3  0.447743 -0.163605  0.514052

The function used for testing purpose:

def test(row):
   # some complex calc here 
   # based on the values from different columns 
   return np.array((1,2,3))

df['D'] = df.apply(test, axis=1)

[...]
ValueError: Wrong number of items passed 1, indices imply 3

The funny is that when I create the dataframe from scratch, it works pretty well, and returns as expected:

dic = {'A': {0: 0.9, 1: -0.6, 2: 1.8, 3: 0.4}, 
     'C': {0: 0.1, 1: 2.8, 2: -0.1, 3: 0.5}, 
     'B': {0: 0.7, 1: -0.6, 2: -0.1, 3: -0.1},
     'D': {0:np.array((1,2,3)), 
          1:np.array((1,2,3)), 
          2:np.array((1,2,3)), 
          3:np.array((1,2,3))}}

df= pd.DataFrame(dic)
print(df)
         A    B    C          D
    0  0.9  0.7  0.1  [1, 2, 3]
    1 -0.6 -0.6  2.8  [1, 2, 3]
    2  1.8 -0.1 -0.1  [1, 2, 3]
    3  0.4 -0.1  0.5  [1, 2, 3]

Thanks in advance

769

asked Sep 05 '13 16:09

1 Answers

If you try to return multiple values from the function that is passed to apply, and the DataFrame you call the apply on has the same number of item along the axis (in this case columns) as the number of values you returned, Pandas will create a DataFrame from the return values with the same labels as the original DataFrame. You can see this if you just do:

>>> def test(row):
        return [1, 2, 3]
>>> df= pd.DataFrame(np.random.randn(4, 3), columns=list('ABC'))
>>> df.apply(test, axis=1)
   A  B  C
0  1  2  3
1  1  2  3
2  1  2  3
3  1  2  3

And that is why you get the error, since you cannot assign a DataFrame to DataFrame column.

If you return any other number of values, it will return just a series object, that can be assigned:

>>> def test(row):
       return [1, 2]
>>> df= pd.DataFrame(np.random.randn(4, 3), columns=list('ABC'))
>>> df.apply(test, axis=1)
0    [1, 2]
1    [1, 2]
2    [1, 2]
3    [1, 2]
>>> df['D'] = df.apply(test, axis=1)
>>> df
          A         B         C       D
0  0.333535  0.209745 -0.972413  [1, 2]
1  0.469590  0.107491 -1.248670  [1, 2]
2  0.234444  0.093290 -0.853348  [1, 2]
3  1.021356  0.092704 -0.406727  [1, 2]

I'm not sure why Pandas does this, and why it does it only when the return value is a list or an ndarray, since it won't do it if you return a tuple:

>>> def test(row):
        return (1, 2, 3)
>>> df= pd.DataFrame(np.random.randn(4, 3), columns=list('ABC'))
>>> df['D'] = df.apply(test, axis=1)
>>> df
          A         B         C          D
0  0.121136  0.541198 -0.281972  (1, 2, 3)
1  0.569091  0.944344  0.861057  (1, 2, 3)
2 -1.742484 -0.077317  0.181656  (1, 2, 3)
3 -1.541244  0.174428  0.660123  (1, 2, 3)

123

answered Oct 02 '22 22:10

Viktor Kerkez

Related questions
                            
                                Creating many related objects like INSERT ... SELECT in SQL
                            
                                Import Python module from memory [duplicate]
                            
                                Effective implementation of one-to-many relationship with Python NDB
                            
                                Combining Grid search and cross validation in scikit learn
                            
                                How do I downcast in python
                            
                                Python PIL struggles with uncompressed 16-bit TIFF images
                            
                                django-registration app and Django 1.5 custom user model
                            
                                How to reduce number of connections using SQLAlchemy + postgreSQL?
                            
                                Python version 2.7: XML ElementTree: How to iterate through certain elements of a child element in order to find a match
                            
                                I am using Python3 and I want to use RabbitMQ
                            
                                Round a Python list of numbers and maintain their sum
                            
                                Incrementing (iterating) between two hex values in Python
                            
                                Matplotlib polar plot radial axis offset
                            
                                Gimp: python script not showing in menu
                            
                                reStructuredText: README.rst not parsing on PyPI
                            
                                running Apache + Bottle + Python
                            
                                Why __instancecheck__ is not always called depending on argument?
                            
                                How to Mock a missing attribute
                            
                                Flask-Admin Blueprint creation during Testing
                            
                                Efficient manipulation of a list of cartesian coordinates in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

pandas: fill a column with some numpy arrays

Tags:

python

pandas

Nic

People also ask

1 Answers

Viktor Kerkez

Recent Activity

Donate For Us