Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mark interpolated NaN points in Pandas plot

When I use interpolation (or fillna, or any other method of generating some fake data) in Pandas, I would like this to show in my plots. Ideally, I would like to use a different marker for these points in the plot. For regular points I want to use filled circles ('o'), for fake data I want to use crosses ('x').

Of course, I would like to do this with a nice Pythonic oneliner.

One further complication is that I want to use the subplots option in the plot function to plot all my columns at once. I'm hoping manipulating the subplots with Matplotlib voodoo is not necessary, though at this point that's the only option I can think of.

The data I'm using is something like the following (put into file 'meterstanden.ssv'):

datum       tijd   gas[m^3]   electra1[kWh] electra2[kWh]  water[m^3]
2015-03-06  09:00  4000.318   10300         9000           300.0
2015-03-24  20:10  4020.220   -             10003          -
2015-08-02  11:15  4120.388   10500         11000          350.5

And here is the script I'm using to process it:

import matplotlib.pyplot as plt
import pandas as pd

df = pd.read_table("meterstanden.ssv", delim_whitespace=True,
                   parse_dates=[[0, 1]], index_col=0, na_values=['-'])

df.interpolate(method='time').plot(subplots=True, layout=(2, 2),
                                   figsize=(14, 10), marker='o')
plt.show()

I want the - entries in the table to be plotted with cross-markers.

like image 234
egpbos Avatar asked Mar 15 '23 05:03

egpbos


1 Answers

I can't come up with pythonic one-liner, but maybe this will do. (With a scatter plot you'd have some more options, like using the s kwarg, although I'm not sure it's ultimately any better than this solution.)

np.random.seed(123)
df=pd.DataFrame({ 'x':np.random.choice([1,2,np.nan],20),
                  'y':np.random.choice([3,4,np.nan],20) },
                  index=pd.date_range('2015-1-1',periods=20) )

I think that regardless of the exact approach it will be necessary to set up two dataframes which differ with respect to inclusion of the imputed values. I'll do it like this.

mask=df.isnull()
df=df.interpolate(method='time')
imputed=df[mask]

Then it's just a matter of overlaying two line plots. One is done normally but the second is with no lines and only includes the imputed values. You don't care about the connecting lines from the imputed values, but you do want to see the points, so you give them a distinguishing marker. I'm using 'o' instead of 'x' as you requested because the 'o' displays a little more clearly but you can change that, of course.

for c in df.columns:
    plt.plot(df[c])
    plt.plot(imputed[c],linestyle='',marker='o')
    plt.show()

enter image description here

enter image description here

You could also consider using the line color to convey information about what parts of the graph are based on imputed values. There are several ways you might do this, here's one.

not_imp=df[~mask]

for c in df.columns:
    plt.plot(df[c],color='r')
    plt.plot(not_imp[c],color='b',marker='o')
    plt.plot(imputed[c],color='r',marker='d',linestyle='')
    plt.show()

So what is happening here is that the blue (circle) markers indicate real (not imputed) values and blue lines connect real to real. Red (diamond) markers indicate imputed values and red lines connect imputed values to other imputed values or to real values.

enter image description here

enter image description here

like image 155
JohnE Avatar answered Mar 24 '23 05:03

JohnE