How to create lazy_evaluated dataframe columns in Pandas

Tags:

A lot of times, I have a big dataframe df to hold the basic data, and need to create many more columns to hold the derivative data calculated by basic data columns.

I can do that in Pandas like:

df['derivative_col1'] = df['basic_col1'] + df['basic_col2']
df['derivative_col2'] = df['basic_col1'] * df['basic_col2']
....
df['derivative_coln'] = func(list_of_basic_cols)

etc. Pandas will calculate and allocate the memory for all derivative columns all at once.

What I want now is to have a lazy evaluation mechanism to postpone the calculation and memory allocation of derivative columns to the actual need moment. Somewhat define the lazy_eval_columns as:

df['derivative_col1'] = pandas.lazy_eval(df['basic_col1'] + df['basic_col2'])
df['derivative_col2'] = pandas.lazy_eval(df['basic_col1'] * df['basic_col2'])

That will save the time/memory like Python 'yield' generator, for if I issue df['derivative_col2'] command will only triger the specific calculation and memory allocation.

So how to do lazy_eval() in Pandas ? Any tip/thought/ref are welcome.

751

asked Oct 26 '13 10:10

bigbug

2 Answers

Starting in 0.13 (releasing very soon), you can do something like this. This is using generators to evaluate a dynamic formula. In-line assignment via eval will be an additional feature in 0.13, see here

In [19]: df = DataFrame(randn(5, 2), columns=['a', 'b'])

In [20]: df
Out[20]: 
          a         b
0 -1.949107 -0.763762
1 -0.382173 -0.970349
2  0.202116  0.094344
3 -1.225579 -0.447545
4  1.739508 -0.400829

In [21]: formulas = [ ('c','a+b'), ('d', 'a*c')]

Create a generator that evaluates a formula using eval; assigns the result, then yields the result.

In [22]: def lazy(x, formulas):
   ....:     for col, f in formulas:
   ....:         x[col] = x.eval(f)
   ....:         yield x
   ....:

In action

In [23]: gen = lazy(df,formulas)

In [24]: gen.next()
Out[24]: 
          a         b         c
0 -1.949107 -0.763762 -2.712869
1 -0.382173 -0.970349 -1.352522
2  0.202116  0.094344  0.296459
3 -1.225579 -0.447545 -1.673123
4  1.739508 -0.400829  1.338679

In [25]: gen.next()
Out[25]: 
          a         b         c         d
0 -1.949107 -0.763762 -2.712869  5.287670
1 -0.382173 -0.970349 -1.352522  0.516897
2  0.202116  0.094344  0.296459  0.059919
3 -1.225579 -0.447545 -1.673123  2.050545
4  1.739508 -0.400829  1.338679  2.328644

So its user determined ordering for the evaluation (and not on-demand). In theory numba is going to support this, so pandas possibly support this as a backend for eval (which currently uses numexpr for immediate evaluation).

my 2c.

lazy evaluation is nice, but can easily be achived by using python's own continuation/generate features, so building it into pandas, while possible, is quite tricky, and would need a really nice usecase to be generally useful.

159

answered Oct 03 '22 22:10

Jeff

You could subclass DataFrame, and add the column as a property. For example,

import pandas as pd

class LazyFrame(pd.DataFrame):
    @property
    def derivative_col1(self):
        self['derivative_col1'] = result = self['basic_col1'] + self['basic_col2']
        return result

x = LazyFrame({'basic_col1':[1,2,3],
               'basic_col2':[4,5,6]})
print(x)
#    basic_col1  basic_col2
# 0           1           4
# 1           2           5
# 2           3           6

Accessing the property (via x.derivative_col1, below) calls the derivative_col1 function defined in LazyFrame. This function computes the result and adds the derived column to the LazyFrame instance:

print(x.derivative_col1)
# 0    5
# 1    7
# 2    9

print(x)
#    basic_col1  basic_col2  derivative_col1
# 0           1           4                5
# 1           2           5                7
# 2           3           6                9

Note that if you modify a basic column:

x['basic_col1'] *= 10

the derived column is not automatically updated:

print(x['derivative_col1'])
# 0    5
# 1    7
# 2    9

But if you access the property, the values are recomputed:

print(x.derivative_col1)
# 0    14
# 1    25
# 2    36

print(x)
#    basic_col1  basic_col2  derivative_col1
# 0          10           4               14
# 1          20           5               25
# 2          30           6               36

answered Oct 03 '22 22:10

unutbu

Related questions
                            
                                Sandboxing in Linux
                            
                                Python: multiple calls to __init__() on the same instance
                            
                                Reason for low Pylint ratings of Python standard library code
                            
                                Change working directory in shell with a python script
                            
                                Prototypal programming in Python
                            
                                Detect last iteration over dictionary.iteritems() in python
                            
                                How is it possible to use raw_input() in a Python Git hook?
                            
                                How can I get generators/iterators to evaluate as False when exhausted?
                            
                                How can I get a weighted random pick from Python's Counter class?
                            
                                Any good text editor - Android app - optimised for programmers? [closed]
                            
                                How can I get value of the nested dictionary using ImmutableMultiDict on Flask?
                            
                                What does a plus sign do in front of a variable in Python?
                            
                                How do I get a "debug" variable in my Django template context?
                            
                                TypeError: zip argument #1 must support iteration
                            
                                How to delete/remove row from the google spreadsheet using gspread lib. in python?
                            
                                Python CSV Error: sequence expected
                            
                                How should I establish and manage database connections in a multi-module Python app?
                            
                                Compute a chain of functions in python
                            
                                Escaping quotes in jinja2
                            
                                How to drawImage a matplotlib figure in a reportlab canvas?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to create lazy_evaluated dataframe columns in Pandas

Tags:

python

pandas

lazy-evaluation

bigbug

People also ask

2 Answers

Jeff

unutbu

Recent Activity

Donate For Us