Consider the following example in which we setup a sample dataset, create a MultiIndex, unstack the dataframe, and then execute a linear interpolation where we fill row-by-row:
import pandas as pd  # version 0.14.1
import numpy as np  # version 1.8.1
df = pd.DataFrame({'location': ['a', 'b'] * 5,
                   'trees': ['oaks', 'maples'] * 5,
                   'year': range(2000, 2005) * 2,
                   'value': [np.NaN, 1, np.NaN, 3, 2, np.NaN, 5, np.NaN, np.NaN, np.NaN]})
df.set_index(['trees', 'location', 'year'], inplace=True)
df = df.unstack()
df = df.interpolate(method='linear', axis=1)
Where the unstacked dataset looks like this:
                 value                        
year              2000  2001  2002  2003  2004
trees  location                               
maples b           NaN     1   NaN     3   NaN
oaks   a           NaN     5   NaN   NaN     2
As an interpolation method, I expect the output:
                 value                        
year              2000  2001  2002  2003  2004
trees  location                               
maples b           NaN     1     2     3   NaN
oaks   a           NaN     5     4     3     2
but instead the method yields (note the extrapolated value):
                 value                        
year              2000  2001  2002  2003  2004
trees  location                               
maples b           NaN     1     2     3     3
oaks   a           NaN     5     4     3     2
Is there way to instruct pandas to not extrapolate past the last non-missing value in a series?
EDIT:
I'd still love to see this functionality in pandas, but for now I've implemented it as a function in numpy and then I use df.apply() to modify the df. It was the functionality of the left and right parameters in np.interp() that I was missing out on in pandas.
def interpolate(a, dec=None):
    """
    :param a: a 1d array to be interpolated
    :param dec: the number of decimal places with which each
                value should be returned
    :return: returns an array of integers or floats
    """
    # default value is the largest number of decimal places in the input array
    if dec is None:
        dec = max_decimal(a)
    # detect array format convert to numpy as necessary
    if type(a) == list:
        t = 'list'
        b = np.asarray(a, dtype='float')
    if type(a) in [pd.Series, np.ndarray]:
        b = a
    # return the row if it's all nan's
    if np.all(np.isnan(b)):
        return a
    # interpolate
    x = np.arange(b.size)
    xp = np.where(~np.isnan(b))[0]
    fp = b[xp]
    interp = np.around(np.interp(x, xp, fp, np.nan, np.nan), decimals=dec)
    # return with proper numerical type formatting
    # check to make sure there aren't nan's before converting to int
    if dec == 0 and np.isnan(np.sum(interp)) == False:
        interp = interp.astype(int)
    if t == 'list':
        return interp.tolist()
    else:
        return interp
# two little helper functions
def count_decimal(i):
    try:
        return int(decimal.Decimal(str(i)).as_tuple().exponent) * -1
    except ValueError:
        return 0
def max_decimal(a):
    m = 0
    for i in a:
        n = count_decimal(i)
        if n > m:
            m = n
    return m
Works like a charm on the example dataset:
In[1]: df.apply(interpolate, axis=1)
Out[1]:
                 value                        
year              2000  2001  2002  2003  2004
trees  location                               
maples b           NaN     1     2     3   NaN
oaks   a           NaN     5     4     3     2
                As of Pandas version 0.21.0, limit_area='inside' tellsdf.interpolate` to only fill NaNs surrounded by valid values:
import pandas as pd  # version 0.21.0
import numpy as np  
df = pd.DataFrame({'location': ['a', 'b'] * 5,
                   'trees': ['oaks', 'maples'] * 5,
                   'year': list(range(2000, 2005)) * 2,
                   'value': [np.NaN, 1, np.NaN, 3, 2, np.NaN, 5, np.NaN, np.NaN, np.NaN]})
df.set_index(['trees', 'location', 'year'], inplace=True)
df = df.unstack()
df2 = df.interpolate(method='linear', axis=1, limit_area='inside')
print(df2)
yields
                value                    
year             2000 2001 2002 2003 2004
trees  location                          
maples b          NaN  1.0  2.0  3.0  NaN
oaks   a          NaN  5.0  4.0  3.0  2.0
                        Replace the following line:
df = df.interpolate(method='linear', axis=1)
with this:
df = df.interpolate(axis=1).where(df.bfill(axis=1).notnull())
It finds a mask for the trailing NaNs by using backfill. It's not extremely efficient because it performs two NaN filling operations, but those issues are probably not a problem typically.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With