Seaborn is a great package for doing some high-level plotting with pretty outputs. However, I'm struggling a little with using Seaborn to overlay both data and model predictions from an externally-fit model. In this example I am fitting models in Statsmodels that are too complex for Seaborn to do out-of-the-box, but I think the problem is more general (i.e. if I have model predictions and want to visualise both them and data using Seaborn). Let's start with imports and a dataset: <pre class="prettyprint"><code>import numpy as np import pandas as pd import seaborn as sns import statsmodels.formula.api as smf import patsy import itertools import matplotlib.pyplot as plt np.random.seed(12345) # make a data frame with one continuous and two categorical variables: df = pd.DataFrame({'x1': np.random.normal(size=100), 'x2': np.tile(np.array(['a', 'b']), 50), 'x3': np.repeat(np.array(['c', 'd']), 50)}) # create a design matrix using patsy: X = patsy.dmatrix('x1 * x2 * x3', df) # some random beta weights: betas = np.random.normal(size=X.shape[1]) # create the response variable as the noisy linear combination of predictors: df['y'] = np.inner(X, betas) + np.random.normal(size=100) </code></pre> We fit a model in statsmodels containing all predictor variables and their interactions: <pre class="prettyprint"><code># fit a model with all interactions fit = smf.ols('y ~ x1 * x2 * x3', df).fit() print(fit.summary()) </code></pre> Since in this case we have all combinations of variables specified, and our model predictions are linear, it would suffice for plotting to add a new "predictions" column to the dataframe containing the model predictions. However, that's not very general (imagine our model is nonlinear and so we want our plots to show smooth curves), so instead I make a new dataframe with all combinations of predictors, then generate predictions: <pre class="prettyprint"><code># create a new dataframe of predictions, using pandas' expand grid: def expand_grid(data_dict): """ A port of R's expand.grid function for use with Pandas dataframes. from http://pandas.pydata.org/pandas-docs/stable/cookbook.html?highlight=expand%20grid """ rows = itertools.product(*data_dict.values()) return pd.DataFrame.from_records(rows, columns=data_dict.keys()) # build a new matrix with expand grid: preds = expand_grid( {'x1': np.linspace(df['x1'].min(), df['x1'].max(), 2), 'x2': ['a', 'b'], 'x3': ['c', 'd']}) preds['yhat'] = fit.predict(preds) </code></pre> The <code>preds</code> dataframe looks like this: <pre class="prettyprint"><code> x3 x1 x2 yhat 0 c -2.370232 a -1.555902 1 c -2.370232 b -2.307295 2 c 3.248944 a -1.555902 3 c 3.248944 b -2.307295 4 d -2.370232 a -1.609652 5 d -2.370232 b -2.837075 6 d 3.248944 a -1.609652 7 d 3.248944 b -2.837075 </code></pre> Since Seaborn plot commands (unlike <code>ggplot2</code> commands in R) appear to accept one and only one dataframe, we need to merge our predictions into the raw data: <pre class="prettyprint"><code># append to df: merged = df.append(preds) </code></pre> We can now plot the model predictions along with the data, with our continuous variable <code>x1</code> as the x-axis: <pre class="prettyprint"><code># plot using seaborn: sns.set_style('white') sns.set_context('talk') g = sns.FacetGrid(merged, hue='x2', col='x3', size=5) # use the `map` method to add stuff to the facetgrid axes: g.map(plt.plot, "x1", "yhat") g.map(plt.scatter, "x1", "y") g.add_legend() g.fig.subplots_adjust(wspace=0.3) sns.despine(offset=10); </code></pre> <img src="https://i.stack.imgur.com/ZlxSX.png" alt="enter image description here"> So far so good. Now imagine that we didn't measure the continuous variable <code>x1</code>, and we only know about the other two (categorical) variables (i.e., we have a 2x2 factorial design). How can we plot the model predictions against data in this case? <pre class="prettyprint"><code>fit = smf.ols('y ~ x2 * x3', df).fit() print(fit.summary()) preds = expand_grid( {'x2': ['a', 'b'], 'x3': ['c', 'd']}) preds['yhat'] = fit.predict(preds) print(preds) # append to df: merged = df.append(preds) </code></pre> Well, we can plot the model predictions using <code>sns.pointplot</code> or similar, like so: <pre class="prettyprint"><code># plot using seaborn: g = sns.FacetGrid(merged, hue='x3', size=4) g.map(sns.pointplot, 'x2', 'yhat') g.add_legend(); sns.despine(offset=10); </code></pre> <img src="https://i.stack.imgur.com/l6ung.png" alt="enter image description here"> Or the data using <code>sns.factorplot</code> like so: <pre class="prettyprint"><code>g = sns.factorplot('x2', 'y', hue='x3', kind='point', data=merged) sns.despine(offset=10); g.savefig('tmp3.png') </code></pre> <img src="https://i.stack.imgur.com/dOpq1.png" alt="enter image description here"> But I do not see how to produce a plot similar to the first one (i.e. lines for model predictions using <code>plt.plot</code>, a scatter of points for data using <code>plt.scatter</code>). The reason is that the <code>x2</code> variable I'm trying to use as the x-axis is a string / object, so the pyplot commands don't know what to do with them.

As I mention in my comments, there are two ways I would think about doing this. The first is to define a function that does the fit and then plots and pass it to <code>FacetGrid.map</code>: <pre class="prettyprint"><code>import pandas as pd import seaborn as sns tips = sns.load_dataset("tips") def plot_good_tip(day, total_bill, **kws): expected_tip = (total_bill.groupby(day) .mean() .apply(lambda x: x * .2) .reset_index(name="tip")) sns.pointplot(expected_tip.day, expected_tip.tip, linestyles=["--"], markers=["D"]) g = sns.FacetGrid(tips, col="sex", size=5) g.map(sns.pointplot, "day", "tip") g.map(plot_good_tip, "day", "total_bill") g.set_axis_labels("day", "tip") </code></pre> <img src="https://i.stack.imgur.com/2vm7o.png" alt="enter image description here"> The second is the compute the predicted values and then merge them into your DataFrame with an additional variable that identifies what is data and what is model: <pre class="prettyprint"><code>tip_predict = (tips.groupby(["day", "sex"]) .total_bill .mean() .apply(lambda x: x * .2) .reset_index(name="tip")) tip_all = pd.concat(dict(data=tips[["day", "sex", "tip"]], model=tip_predict), names=["kind"]).reset_index() sns.factorplot("day", "tip", "kind", data=tip_all, col="sex", kind="point", linestyles=["-", "--"], markers=["o", "D"]) </code></pre> <img src="https://i.stack.imgur.com/1sTqM.png" alt="enter image description here">

Showing data and model predictions in one plot using Seaborn and Statsmodels

Tags:

python

matplotlib

seaborn

statsmodels

Seaborn is a great package for doing some high-level plotting with pretty outputs. However, I'm struggling a little with using Seaborn to overlay both data and model predictions from an externally-fit model. In this example I am fitting models in Statsmodels that are too complex for Seaborn to do out-of-the-box, but I think the problem is more general (i.e. if I have model predictions and want to visualise both them and data using Seaborn).

Let's start with imports and a dataset:

import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.formula.api as smf
import patsy
import itertools
import matplotlib.pyplot as plt

np.random.seed(12345)

# make a data frame with one continuous and two categorical variables:
df = pd.DataFrame({'x1': np.random.normal(size=100),
                     'x2': np.tile(np.array(['a', 'b']), 50),
                     'x3': np.repeat(np.array(['c', 'd']), 50)})

# create a design matrix using patsy:
X = patsy.dmatrix('x1 * x2 * x3', df)

# some random beta weights:
betas = np.random.normal(size=X.shape[1])

# create the response variable as the noisy linear combination of predictors:
df['y'] = np.inner(X, betas) + np.random.normal(size=100)

We fit a model in statsmodels containing all predictor variables and their interactions:

# fit a model with all interactions
fit = smf.ols('y ~ x1 * x2 * x3', df).fit()
print(fit.summary())

Since in this case we have all combinations of variables specified, and our model predictions are linear, it would suffice for plotting to add a new "predictions" column to the dataframe containing the model predictions. However, that's not very general (imagine our model is nonlinear and so we want our plots to show smooth curves), so instead I make a new dataframe with all combinations of predictors, then generate predictions:

# create a new dataframe of predictions, using pandas' expand grid:
def expand_grid(data_dict):
    """ A port of R's expand.grid function for use with Pandas dataframes.

    from http://pandas.pydata.org/pandas-docs/stable/cookbook.html?highlight=expand%20grid

    """
    rows = itertools.product(*data_dict.values())
    return pd.DataFrame.from_records(rows, columns=data_dict.keys())


# build a new matrix with expand grid:

preds = expand_grid(
                {'x1': np.linspace(df['x1'].min(), df['x1'].max(), 2),
                 'x2': ['a', 'b'],
                 'x3': ['c', 'd']})
preds['yhat'] = fit.predict(preds)

The preds dataframe looks like this:

  x3        x1 x2      yhat
0  c -2.370232  a -1.555902
1  c -2.370232  b -2.307295
2  c  3.248944  a -1.555902
3  c  3.248944  b -2.307295
4  d -2.370232  a -1.609652
5  d -2.370232  b -2.837075
6  d  3.248944  a -1.609652
7  d  3.248944  b -2.837075

Since Seaborn plot commands (unlike ggplot2 commands in R) appear to accept one and only one dataframe, we need to merge our predictions into the raw data:

# append to df:
merged = df.append(preds)

We can now plot the model predictions along with the data, with our continuous variable x1 as the x-axis:

# plot using seaborn:
sns.set_style('white')
sns.set_context('talk')
g = sns.FacetGrid(merged, hue='x2', col='x3', size=5)
# use the `map` method to add stuff to the facetgrid axes:
g.map(plt.plot, "x1", "yhat")
g.map(plt.scatter, "x1", "y")
g.add_legend()
g.fig.subplots_adjust(wspace=0.3)
sns.despine(offset=10);

enter image description here

So far so good. Now imagine that we didn't measure the continuous variable x1, and we only know about the other two (categorical) variables (i.e., we have a 2x2 factorial design). How can we plot the model predictions against data in this case?

fit = smf.ols('y ~ x2 * x3', df).fit()
print(fit.summary())

preds = expand_grid(
                {'x2': ['a', 'b'],
                 'x3': ['c', 'd']})
preds['yhat'] = fit.predict(preds)
print(preds)

# append to df:
merged = df.append(preds)

Well, we can plot the model predictions using sns.pointplot or similar, like so:

# plot using seaborn:
g = sns.FacetGrid(merged, hue='x3', size=4)
g.map(sns.pointplot, 'x2', 'yhat')
g.add_legend();
sns.despine(offset=10);

enter image description here

Or the data using sns.factorplot like so:

g = sns.factorplot('x2', 'y', hue='x3', kind='point', data=merged)
sns.despine(offset=10);
g.savefig('tmp3.png')

enter image description here

But I do not see how to produce a plot similar to the first one (i.e. lines for model predictions using plt.plot, a scatter of points for data using plt.scatter). The reason is that the x2 variable I'm trying to use as the x-axis is a string / object, so the pyplot commands don't know what to do with them.

824

asked Jan 30 '15 15:01

tsawallis

1 Answers

As I mention in my comments, there are two ways I would think about doing this.

The first is to define a function that does the fit and then plots and pass it to FacetGrid.map:

import pandas as pd
import seaborn as sns
tips = sns.load_dataset("tips")

def plot_good_tip(day, total_bill, **kws):

    expected_tip = (total_bill.groupby(day)
                              .mean()
                              .apply(lambda x: x * .2)
                              .reset_index(name="tip"))
    sns.pointplot(expected_tip.day, expected_tip.tip,
                  linestyles=["--"], markers=["D"])

g = sns.FacetGrid(tips, col="sex", size=5)
g.map(sns.pointplot, "day", "tip")
g.map(plot_good_tip, "day", "total_bill")
g.set_axis_labels("day", "tip")

enter image description here

The second is the compute the predicted values and then merge them into your DataFrame with an additional variable that identifies what is data and what is model:

tip_predict = (tips.groupby(["day", "sex"])
                   .total_bill
                   .mean()
                   .apply(lambda x: x * .2)
                   .reset_index(name="tip"))
tip_all = pd.concat(dict(data=tips[["day", "sex", "tip"]], model=tip_predict),
                    names=["kind"]).reset_index()

sns.factorplot("day", "tip", "kind", data=tip_all, col="sex",
               kind="point", linestyles=["-", "--"], markers=["o", "D"])

enter image description here

126

answered Oct 22 '22 17:10

mwaskom

Related questions
                            
                                Date formatting using python
                            
                                Django allauth Redirect after social signup
                            
                                Python + Hachoir-Metadata - Reading FPS tag from .MP4 file
                            
                                Bokeh - get information about points that have been selected
                            
                                Optimizing dict of set of tuple of ints with Numba?
                            
                                Mock superclass __init__ method or superclass as a whole for testing
                            
                                Extracting tables from a pdf
                            
                                BeautifulSoup / Python - Convert HTML table to CSV and get href for one column
                            
                                Python - IOError: [Errno 2] No such file or directory: u'lastid.py' for file in same directory. Works locally, doesn't on Heroku
                            
                                Tabs for indentation in python files in vim
                            
                                What is the design reason for the fact that if __new__ does not return an instance of cls, python does not invoke __init__?
                            
                                How do I find duplicate indices in a DataFrame?
                            
                                Create new torrent and seed
                            
                                how can I combine training set specific learned parameters with sklearn online (out-of-core) learning
                            
                                Running asynchronous python code in a Django web application
                            
                                How can i write python decorator for caching?
                            
                                pandas to_html using the .style options or custom CSS?
                            
                                Download data from a jupyter server
                            
                                Tutorials on optimizing non-trivial Python applications with C extensions or Cython
                            
                                Python PIL Detect if an image is completely black or white

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Showing data and model predictions in one plot using Seaborn and Statsmodels

Tags:

python

matplotlib

seaborn

statsmodels

tsawallis

People also ask

1 Answers

mwaskom

Recent Activity

Donate For Us