Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: Make a new column by linearly interpolating between existing columns

Tags:

python

pandas

Say I have a DataFrame containing data about the temperature at various altitudes on a mountain, each sampled simultaneously once per day. The altitude of each probe is fixed (i.e. they stay constant from day to day) and is known. Each row represents a different timestamp, and I have a separate column to record the temperature observed by each probe. I also have a column (targ_alt) that contains an "altitudes of interest" for each row.

My goal is to add a new column called intreped_temp which contains, for each row, the temperature that you would get for that row's targ_alt by linearly interpolating between the temperatures of the probes at their known altitudes. What is the best way to do this?

Here is some setup code so we can look at the same context:

import pandas as pd
import numpy as np

np.random.seed(1)

n = 10
probe_alts = {'base': 1000, 'mid': 2000, 'peak': 3500}
# let's make the temperatures decrease at higher altitudes...just for style
temp_readings = {k: np.random.randn(n) + 15 - v/300 for k, v in probe_alts.items()}
df = pd.DataFrame(temp_readings)

targ_alt = 2000 + (500 * np.random.randn(n))
df['targ_alt'] = targ_alt

So df looks like this:

        base        mid      peak     targ_alt
0  13.624345  10.462108  2.899381  1654.169624
1  11.388244   6.939859  5.144724  1801.623237
2  11.471828   8.677583  4.901591  1656.413650
3  10.927031   8.615946  4.502494  1577.397179
4  12.865408  10.133769  4.900856  1664.376935
5   9.698461   7.900109  3.316272  1993.667701
6  13.744812   8.827572  3.877110  1441.344826
7  11.238793   8.122142  3.064231  2117.207849
8  12.319039   9.042214  3.732112  2829.901089
9  11.750630   9.582815  4.530355  2371.022080
like image 866
8one6 Avatar asked Dec 29 '13 04:12

8one6


1 Answers

In the example I gave above, I wanted to interp to a different x-coordinate within each row. Fine. If you don't...if you want to interp to the same x-coordinate within each row, there are incredible time savings to be had by using SciPy. See example below:

import numpy as np
import pandas as pd
from scipy.interpolate import interp1d

np.random.seed(1)
n = 10e4

df = pd.DataFrame({'a': np.random.randn(n), 
                   'b': 10 + np.random.randn(n), 
                   'c': 30 + np.random.randn(n)})

xs = [-10, 0, 10]
cvs = df.columns.values

Now consider 3 different ways to tack on a column which will interpolate between the given columns to an x-coordinate of 5:

%timeit df['n1'] = df.apply(lambda row: np.interp(5, xs, row[cvs]), axis=1)
%timeit df['n2'] = df.apply(lambda row: np.interp(5, xs, tuple([row[j] for j in cvs])), axis=1)
%timeit df['n3'] = interp1d(xs, df[cvs])(5)

Here are the results for n=1e2:

100 loops, best of 3: 13.2 ms per loop
1000 loops, best of 3: 1.24 ms per loop
1000 loops, best of 3: 488 µs per loop

And for n=1e4:

1 loops, best of 3: 1.33 s per loop
10 loops, best of 3: 109 ms per loop
1000 loops, best of 3: 798 µs per loop

And for n=1e6:

# first one is too slow to wait for
1 loops, best of 3: 10.9 s per loop
10 loops, best of 3: 58.3 ms per loop

One followup question: is there a fast way to modify this code so that it could handle x inputs outside the min-max range of the training data through linear extrapolation?

like image 102
8one6 Avatar answered Sep 20 '22 21:09

8one6