Say I have a DataFrame containing data about the temperature at various altitudes on a mountain, each sampled simultaneously once per day. The altitude of each probe is fixed (i.e. they stay constant from day to day) and is known. Each row represents a different timestamp, and I have a separate column to record the temperature observed by each probe. I also have a column (targ_alt
) that contains an "altitudes of interest" for each row.
My goal is to add a new column called intreped_temp
which contains, for each row, the temperature that you would get for that row's targ_alt
by linearly interpolating between the temperatures of the probes at their known altitudes. What is the best way to do this?
Here is some setup code so we can look at the same context:
import pandas as pd
import numpy as np
np.random.seed(1)
n = 10
probe_alts = {'base': 1000, 'mid': 2000, 'peak': 3500}
# let's make the temperatures decrease at higher altitudes...just for style
temp_readings = {k: np.random.randn(n) + 15 - v/300 for k, v in probe_alts.items()}
df = pd.DataFrame(temp_readings)
targ_alt = 2000 + (500 * np.random.randn(n))
df['targ_alt'] = targ_alt
So df
looks like this:
base mid peak targ_alt
0 13.624345 10.462108 2.899381 1654.169624
1 11.388244 6.939859 5.144724 1801.623237
2 11.471828 8.677583 4.901591 1656.413650
3 10.927031 8.615946 4.502494 1577.397179
4 12.865408 10.133769 4.900856 1664.376935
5 9.698461 7.900109 3.316272 1993.667701
6 13.744812 8.827572 3.877110 1441.344826
7 11.238793 8.122142 3.064231 2117.207849
8 12.319039 9.042214 3.732112 2829.901089
9 11.750630 9.582815 4.530355 2371.022080
In the example I gave above, I wanted to interp to a different x-coordinate within each row. Fine. If you don't...if you want to interp to the same x-coordinate within each row, there are incredible time savings to be had by using SciPy. See example below:
import numpy as np
import pandas as pd
from scipy.interpolate import interp1d
np.random.seed(1)
n = 10e4
df = pd.DataFrame({'a': np.random.randn(n),
'b': 10 + np.random.randn(n),
'c': 30 + np.random.randn(n)})
xs = [-10, 0, 10]
cvs = df.columns.values
Now consider 3 different ways to tack on a column which will interpolate between the given columns to an x-coordinate of 5:
%timeit df['n1'] = df.apply(lambda row: np.interp(5, xs, row[cvs]), axis=1)
%timeit df['n2'] = df.apply(lambda row: np.interp(5, xs, tuple([row[j] for j in cvs])), axis=1)
%timeit df['n3'] = interp1d(xs, df[cvs])(5)
Here are the results for n=1e2:
100 loops, best of 3: 13.2 ms per loop
1000 loops, best of 3: 1.24 ms per loop
1000 loops, best of 3: 488 µs per loop
And for n=1e4:
1 loops, best of 3: 1.33 s per loop
10 loops, best of 3: 109 ms per loop
1000 loops, best of 3: 798 µs per loop
And for n=1e6:
# first one is too slow to wait for
1 loops, best of 3: 10.9 s per loop
10 loops, best of 3: 58.3 ms per loop
One followup question: is there a fast way to modify this code so that it could handle x inputs outside the min-max range of the training data through linear extrapolation?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With