A lot of times, I have a big dataframe df
to hold the basic data, and need to create many more columns to hold the derivative data calculated by basic data columns.
I can do that in Pandas like:
df['derivative_col1'] = df['basic_col1'] + df['basic_col2']
df['derivative_col2'] = df['basic_col1'] * df['basic_col2']
....
df['derivative_coln'] = func(list_of_basic_cols)
etc. Pandas will calculate and allocate the memory for all derivative columns all at once.
What I want now is to have a lazy evaluation mechanism to postpone the calculation and memory allocation of derivative columns to the actual need moment. Somewhat define the lazy_eval_columns as:
df['derivative_col1'] = pandas.lazy_eval(df['basic_col1'] + df['basic_col2'])
df['derivative_col2'] = pandas.lazy_eval(df['basic_col1'] * df['basic_col2'])
That will save the time/memory like Python 'yield' generator, for if I issue df['derivative_col2']
command will only triger the specific calculation and memory allocation.
So how to do lazy_eval()
in Pandas ? Any tip/thought/ref are welcome.
You can change the column type in pandas dataframe using the df. astype() method.
To create an index, from a column, in Pandas dataframe you use the set_index() method. For example, if you want the column “Year” to be index you type <code>df. set_index(“Year”)</code>. Now, the set_index() method will return the modified dataframe as a result.
One of the way to create Pandas DataFrame is by using zip() function. You can use the lists to create lists of tuples and create a dictionary from it. Then, this dictionary can be used to construct a dataframe. zip() function creates the objects and that can be used to produce single item at a time.
Starting in 0.13 (releasing very soon), you can do something like this. This is using generators to evaluate a dynamic formula. In-line assignment via eval will be an additional feature in 0.13, see here
In [19]: df = DataFrame(randn(5, 2), columns=['a', 'b'])
In [20]: df
Out[20]:
a b
0 -1.949107 -0.763762
1 -0.382173 -0.970349
2 0.202116 0.094344
3 -1.225579 -0.447545
4 1.739508 -0.400829
In [21]: formulas = [ ('c','a+b'), ('d', 'a*c')]
Create a generator that evaluates a formula using eval
; assigns the result, then yields the result.
In [22]: def lazy(x, formulas):
....: for col, f in formulas:
....: x[col] = x.eval(f)
....: yield x
....:
In action
In [23]: gen = lazy(df,formulas)
In [24]: gen.next()
Out[24]:
a b c
0 -1.949107 -0.763762 -2.712869
1 -0.382173 -0.970349 -1.352522
2 0.202116 0.094344 0.296459
3 -1.225579 -0.447545 -1.673123
4 1.739508 -0.400829 1.338679
In [25]: gen.next()
Out[25]:
a b c d
0 -1.949107 -0.763762 -2.712869 5.287670
1 -0.382173 -0.970349 -1.352522 0.516897
2 0.202116 0.094344 0.296459 0.059919
3 -1.225579 -0.447545 -1.673123 2.050545
4 1.739508 -0.400829 1.338679 2.328644
So its user determined ordering for the evaluation (and not on-demand). In theory numba
is going to support this, so pandas possibly support this as a backend for eval
(which currently uses numexpr for immediate evaluation).
my 2c.
lazy evaluation is nice, but can easily be achived by using python's own continuation/generate features, so building it into pandas, while possible, is quite tricky, and would need a really nice usecase to be generally useful.
You could subclass DataFrame
, and add the column as a property. For example,
import pandas as pd
class LazyFrame(pd.DataFrame):
@property
def derivative_col1(self):
self['derivative_col1'] = result = self['basic_col1'] + self['basic_col2']
return result
x = LazyFrame({'basic_col1':[1,2,3],
'basic_col2':[4,5,6]})
print(x)
# basic_col1 basic_col2
# 0 1 4
# 1 2 5
# 2 3 6
Accessing the property (via x.derivative_col1
, below) calls the derivative_col1
function defined in LazyFrame. This function computes the result and adds the derived column to the LazyFrame instance:
print(x.derivative_col1)
# 0 5
# 1 7
# 2 9
print(x)
# basic_col1 basic_col2 derivative_col1
# 0 1 4 5
# 1 2 5 7
# 2 3 6 9
Note that if you modify a basic column:
x['basic_col1'] *= 10
the derived column is not automatically updated:
print(x['derivative_col1'])
# 0 5
# 1 7
# 2 9
But if you access the property, the values are recomputed:
print(x.derivative_col1)
# 0 14
# 1 25
# 2 36
print(x)
# basic_col1 basic_col2 derivative_col1
# 0 10 4 14
# 1 20 5 25
# 2 30 6 36
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With