Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Vectorize integration of pandas.DataFrame

I have a DataFrame of force-displacement data. The displacement array has been set to the DataFrame index, and the columns are my various force curves for different tests.

How do I calculate the work done (which is "the area under the curve")?

I looked at numpy.trapz which seems to do what I need, but I think that I can avoid looping over each column like this:

import numpy as np
import pandas as pd 

forces = pd.read_csv(...)
work_done = {}

for col in forces.columns:
    work_done[col] = np.trapz(forces.loc[col], forces.index))

I was hoping to create a new DataFrame of the areas under the curves rather than a dict, and thought that DataFrame.apply() or something might be appropriate but don't know where to start looking.

In short:

  1. Can I avoid the looping?
  2. Can I create a DataFrame of work done directly?

Thanks in advance for any help.

like image 491
blokeley Avatar asked Dec 31 '15 15:12

blokeley


1 Answers

You could vectorize this by passing the whole DataFrame to np.trapz and specifying the axis= argument, e.g.:

import numpy as np
import pandas as pd

# some random input data
gen = np.random.RandomState(0)
x = gen.randn(100, 10)
names = [chr(97 + i) for i in range(10)]
forces = pd.DataFrame(x, columns=names)

# vectorized version
wrk = np.trapz(forces, x=forces.index, axis=0)
work_done = pd.DataFrame(wrk[None, :], columns=forces.columns)

# non-vectorized version for comparison
work_done2 = {}
for col in forces.columns:
    work_done2.update({col:np.trapz(forces.loc[:, col], forces.index)})

These give the following output:

from pprint import pprint

pprint(work_done.T)
#            0
# a -24.331560
# b -10.347663
# c   4.662212
# d -12.536040
# e -10.276861
# f   3.406740
# g  -3.712674
# h  -9.508454
# i  -1.044931
# j  15.165782

pprint(work_done2)
# {'a': -24.331559643023006,
#  'b': -10.347663159421426,
#  'c': 4.6622123535050459,
#  'd': -12.536039649161403,
#  'e': -10.276861220217308,
#  'f': 3.4067399176289994,
#  'g': -3.7126739591045541,
#  'h': -9.5084536839888187,
#  'i': -1.0449311137294459,
#  'j': 15.165781517623724}

There are a couple of other problems with your original example. col is a column name rather than a row index, so it needs to index the second dimension of your dataframe (i.e. .loc[:, col] rather than .loc[col]). Also, you have an extra trailing parenthesis on the last line.


Edit:

You could also generate the output DataFrame directly by .applying np.trapz to each column, e.g.:

work_done = forces.apply(np.trapz, axis=0, args=(forces.index,))

However, this isn't really 'proper' vectorization - you are still calling np.trapz separately on each column. You can see this by comparing the speed of the .apply version against calling np.trapz directly:

In [1]: %timeit forces.apply(np.trapz, axis=0, args=(forces.index,))
1000 loops, best of 3: 582 µs per loop

In [2]: %timeit np.trapz(forces, x=forces.index, axis=0)
The slowest run took 6.04 times longer than the fastest. This could mean that an
intermediate result is being cached 
10000 loops, best of 3: 53.4 µs per loop

This isn't an entirely fair comparison, since the second version excludes the extra time taken to construct the DataFrame from the output numpy array, but this should still be smaller than the difference in time taken to perform the actual integration.

like image 156
ali_m Avatar answered Oct 17 '22 22:10

ali_m