I have a pandas DataFrame, <code>st</code> containing multiple columns: <pre class="prettyprint"><code><class 'pandas.core.frame.DataFrame'> DatetimeIndex: 53732 entries, 1993-01-07 12:23:58 to 2012-12-02 20:06:23 Data columns: Date(dd-mm-yy)_Time(hh-mm-ss) 53732 non-null values Julian_Day 53732 non-null values AOT_1020 53716 non-null values AOT_870 53732 non-null values AOT_675 53188 non-null values AOT_500 51687 non-null values AOT_440 53727 non-null values AOT_380 51864 non-null values AOT_340 52852 non-null values Water(cm) 51687 non-null values %TripletVar_1020 53710 non-null values %TripletVar_870 53726 non-null values %TripletVar_675 53182 non-null values %TripletVar_500 51683 non-null values %TripletVar_440 53721 non-null values %TripletVar_380 51860 non-null values %TripletVar_340 52846 non-null values 440-870Angstrom 53732 non-null values 380-500Angstrom 52253 non-null values 440-675Angstrom 53732 non-null values 500-870Angstrom 53732 non-null values 340-440Angstrom 53277 non-null values Last_Processing_Date(dd/mm/yyyy) 53732 non-null values Solar_Zenith_Angle 53732 non-null values dtypes: datetime64[ns](1), float64(22), object(1) </code></pre> I want to create two new columns for this dataframe based on applying a function to each row of the dataframe. I don't want to have to call the function multiple times (eg. by doing two separate <code>apply</code> calls) as it is rather computationally intensive. I have tried doing this in two ways, and neither of them work: <hr> Using <code>apply</code>: I have written a function which takes a <code>Series</code> and returns a tuple of the values I want: <pre class="prettyprint"><code>def calculate(s): a = s['path'] + 2*s['row'] # Simple calc for example b = s['path'] * 0.153 return (a, b) </code></pre> Trying to apply this to the DataFrame gives an error: <pre class="prettyprint"><code>st.apply(calculate, axis=1) --------------------------------------------------------------------------- AssertionError Traceback (most recent call last) <ipython-input-248-acb7a44054a7> in <module>() ----> 1 st.apply(calculate, axis=1) C:\Python27\lib\site-packages\pandas\core\frame.pyc in apply(self, func, axis, broadcast, raw, args, **kwds) 4191 return self._apply_raw(f, axis) 4192 else: -> 4193 return self._apply_standard(f, axis) 4194 else: 4195 return self._apply_broadcast(f, axis) C:\Python27\lib\site-packages\pandas\core\frame.pyc in _apply_standard(self, func, axis, ignore_failures) 4274 index = None 4275 -> 4276 result = self._constructor(data=results, index=index) 4277 result.rename(columns=dict(zip(range(len(res_index)), res_index)), 4278 inplace=True) C:\Python27\lib\site-packages\pandas\core\frame.pyc in __init__(self, data, index, columns, dtype, copy) 390 mgr = self._init_mgr(data, index, columns, dtype=dtype, copy=copy) 391 elif isinstance(data, dict): --> 392 mgr = self._init_dict(data, index, columns, dtype=dtype) 393 elif isinstance(data, ma.MaskedArray): 394 mask = ma.getmaskarray(data) C:\Python27\lib\site-packages\pandas\core\frame.pyc in _init_dict(self, data, index, columns, dtype) 521 522 return _arrays_to_mgr(arrays, data_names, index, columns, --> 523 dtype=dtype) 524 525 def _init_ndarray(self, values, index, columns, dtype=None, C:\Python27\lib\site-packages\pandas\core\frame.pyc in _arrays_to_mgr(arrays, arr_names, index, columns, dtype) 5411 5412 # consolidate for now -> 5413 mgr = BlockManager(blocks, axes) 5414 return mgr.consolidate() 5415 C:\Python27\lib\site-packages\pandas\core\internals.pyc in __init__(self, blocks, axes, do_integrity_check) 802 803 if do_integrity_check: --> 804 self._verify_integrity() 805 806 self._consolidate_check() C:\Python27\lib\site-packages\pandas\core\internals.pyc in _verify_integrity(self) 892 "items") 893 if block.values.shape[1:] != mgr_shape[1:]: --> 894 raise AssertionError('Block shape incompatible with manager') 895 tot_items = sum(len(x.items) for x in self.blocks) 896 if len(self.items) != tot_items: AssertionError: Block shape incompatible with manager </code></pre> I was then going to assign the values returned from <code>apply</code> to two new columns using the method shown in this question. However, I can't even get to this point! This all works fine if I just return one value. <hr> Using a loop: I first created two new columns of the dataframe and set them to <code>None</code>: <pre class="prettyprint"><code>st['a'] = None st['b'] = None </code></pre> Then looped over all of the indices and tried to modify these <code>None</code> values that I'd got in there, but the modifications I did didn't seem to work. That is, no error was generated, but the DataFrame didn't seem to be modified. <pre class="prettyprint"><code>for i in st.index: # do calc here st.ix[i]['a'] = a st.ix[i]['b'] = b </code></pre> <hr> I thought that both of these methods would work, but neither of them did. So, what am I doing wrong here? And what is the best, most 'pythonic' and 'pandaonic' way to do this?

To make the first approach work, try returning a Series instead of a tuple (apply is throwing an exception because it doesn't know how to glue the rows back together as the number of columns doesn't match the original frame). <pre class="prettyprint"><code>def calculate(s): a = s['path'] + 2*s['row'] # Simple calc for example b = s['path'] * 0.153 return pd.Series(dict(col1=a, col2=b)) </code></pre> The second approach should work if you replace: <pre class="prettyprint"><code>st.ix[i]['a'] = a </code></pre> with: <pre class="prettyprint"><code>st.ix[i, 'a'] = a </code></pre>

Apply function to each row of pandas dataframe to create two new columns

Tags:

python

pandas

I have a pandas DataFrame, st containing multiple columns:

<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 53732 entries, 1993-01-07 12:23:58 to 2012-12-02 20:06:23 Data columns: Date(dd-mm-yy)_Time(hh-mm-ss)       53732  non-null values Julian_Day                          53732  non-null values AOT_1020                            53716  non-null values AOT_870                             53732  non-null values AOT_675                             53188  non-null values AOT_500                             51687  non-null values AOT_440                             53727  non-null values AOT_380                             51864  non-null values AOT_340                             52852  non-null values Water(cm)                           51687  non-null values %TripletVar_1020                    53710  non-null values %TripletVar_870                     53726  non-null values %TripletVar_675                     53182  non-null values %TripletVar_500                     51683  non-null values %TripletVar_440                     53721  non-null values %TripletVar_380                     51860  non-null values %TripletVar_340                     52846  non-null values 440-870Angstrom                     53732  non-null values 380-500Angstrom                     52253  non-null values 440-675Angstrom                     53732  non-null values 500-870Angstrom                     53732  non-null values 340-440Angstrom                     53277  non-null values Last_Processing_Date(dd/mm/yyyy)    53732  non-null values Solar_Zenith_Angle                  53732  non-null values dtypes: datetime64[ns](1), float64(22), object(1)

I want to create two new columns for this dataframe based on applying a function to each row of the dataframe. I don't want to have to call the function multiple times (eg. by doing two separate apply calls) as it is rather computationally intensive. I have tried doing this in two ways, and neither of them work:

Using apply:

I have written a function which takes a Series and returns a tuple of the values I want:

def calculate(s):     a = s['path'] + 2*s['row'] # Simple calc for example     b = s['path'] * 0.153     return (a, b)

Trying to apply this to the DataFrame gives an error:

st.apply(calculate, axis=1) --------------------------------------------------------------------------- AssertionError                            Traceback (most recent call last) <ipython-input-248-acb7a44054a7> in <module>() ----> 1 st.apply(calculate, axis=1)  C:\Python27\lib\site-packages\pandas\core\frame.pyc in apply(self, func, axis, broadcast, raw, args, **kwds)    4191                     return self._apply_raw(f, axis)    4192                 else: -> 4193                     return self._apply_standard(f, axis)    4194             else:    4195                 return self._apply_broadcast(f, axis)  C:\Python27\lib\site-packages\pandas\core\frame.pyc in _apply_standard(self, func, axis, ignore_failures)    4274                 index = None    4275  -> 4276             result = self._constructor(data=results, index=index)    4277             result.rename(columns=dict(zip(range(len(res_index)), res_index)),    4278                           inplace=True)  C:\Python27\lib\site-packages\pandas\core\frame.pyc in __init__(self, data, index, columns, dtype, copy)     390             mgr = self._init_mgr(data, index, columns, dtype=dtype, copy=copy)     391         elif isinstance(data, dict): --> 392             mgr = self._init_dict(data, index, columns, dtype=dtype)     393         elif isinstance(data, ma.MaskedArray):     394             mask = ma.getmaskarray(data)  C:\Python27\lib\site-packages\pandas\core\frame.pyc in _init_dict(self, data, index, columns, dtype)     521      522         return _arrays_to_mgr(arrays, data_names, index, columns, --> 523                               dtype=dtype)     524      525     def _init_ndarray(self, values, index, columns, dtype=None,  C:\Python27\lib\site-packages\pandas\core\frame.pyc in _arrays_to_mgr(arrays, arr_names, index, columns, dtype)    5411     5412     # consolidate for now -> 5413     mgr = BlockManager(blocks, axes)    5414     return mgr.consolidate()    5415   C:\Python27\lib\site-packages\pandas\core\internals.pyc in __init__(self, blocks, axes, do_integrity_check)     802      803         if do_integrity_check: --> 804             self._verify_integrity()     805      806         self._consolidate_check()  C:\Python27\lib\site-packages\pandas\core\internals.pyc in _verify_integrity(self)     892                                      "items")     893             if block.values.shape[1:] != mgr_shape[1:]: --> 894                 raise AssertionError('Block shape incompatible with manager')     895         tot_items = sum(len(x.items) for x in self.blocks)     896         if len(self.items) != tot_items:  AssertionError: Block shape incompatible with manager

I was then going to assign the values returned from apply to two new columns using the method shown in this question. However, I can't even get to this point! This all works fine if I just return one value.

Using a loop:

I first created two new columns of the dataframe and set them to None:

st['a'] = None st['b'] = None

Then looped over all of the indices and tried to modify these None values that I'd got in there, but the modifications I did didn't seem to work. That is, no error was generated, but the DataFrame didn't seem to be modified.

for i in st.index:     # do calc here     st.ix[i]['a'] = a     st.ix[i]['b'] = b

I thought that both of these methods would work, but neither of them did. So, what am I doing wrong here? And what is the best, most 'pythonic' and 'pandaonic' way to do this?

347

asked Feb 27 '13 17:02

robintw

1 Answers

To make the first approach work, try returning a Series instead of a tuple (apply is throwing an exception because it doesn't know how to glue the rows back together as the number of columns doesn't match the original frame).

def calculate(s):     a = s['path'] + 2*s['row'] # Simple calc for example     b = s['path'] * 0.153     return pd.Series(dict(col1=a, col2=b))

The second approach should work if you replace:

st.ix[i]['a'] = a

with:

st.ix[i, 'a'] = a

125

answered Sep 21 '22 15:09

Garrett

Related questions
                            
                                how to reverse the URL of a ViewSet's custom action in django restframework
                            
                                Why is the compiler package discontinued in Python 3?
                            
                                Use pdb.set_trace() in a script that reads stdin via a pipe
                            
                                Is it possible to vectorize recursive calculation of a NumPy array where each element depends on the previous one?
                            
                                Break on unhandled exception in pycharm
                            
                                Who runs the callback when using apply_async method of a multiprocessing pool?
                            
                                Python logging configuration file
                            
                                Why is 2 * x * x faster than 2 * ( x * x ) in Python 3.x, for integers?
                            
                                TFIDF for Large Dataset
                            
                                What's the equivalent of Python's Celery project for Java?
                            
                                grid search over multiple classifiers
                            
                                Is it good practice to use `import __main__`?
                            
                                Python auto import extension for VSCode
                            
                                Psycopg2, Postgresql, Python: Fastest way to bulk-insert
                            
                                How to read a CSV file from a stream and process each line as it is written?
                            
                                Py_INCREF/DECREF: When
                            
                                a += b not the same as a = a + b [duplicate]
                            
                                How to use the @shared_task decorator for class based tasks
                            
                                Is .data still useful in pytorch?
                            
                                Check if a row in one data frame exist in another data frame

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With