Why is my Normal Q-Q Plot of residuals a vertical line?

Question

I am using a Q-Q Plot to test if the residuals of my linear regression follow a normal distribution but the result is a vertical line.

It looks like linear regression is a pretty good model for this dataset, so shouldn't the residuals be normally distributed?

The regression line and the non-normalized points used to predict.

Residual Plot

Q-Q Plot

The points were created randomly:

import numpy as np

x_values = np.linspace(0, 5, 100)[:, np.newaxis]
y_values = 29 * x_values + 30 * np.random.rand(100, 1)

Then, I fitted a Linear Regression model:

from sklearn.linear_model import LinearRegression

reg = LinearRegression()
reg.fit(x_values, y_values)
predictions = reg.predict(x_values)
residuals = y_values - predictions

Finally, I used the statsmodel module to plot the Q-Q Plot of the residuals:

fig = sm.qqplot(residuals, line='45')

Arturo Sbr · Accepted Answer

The whole idea of a Q-Q plot is to compare the quantiles of a true normal distribution against those of your residuals.

Hence, if the quantiles of the theoretical distribution (which is in fact normal) match those of your residuals (aka, they look like a straight line when plotted against each other), then you can conclude that the model from which you derived those residuals is a good fit.

shouldn't the residuals be normally distributed?

If you look at your code, you'll see that the random terms in y_values are NOT normally distributed (you generated them with numpy.random.rand, which does not follow a normal distribution).

Regardless, if your Q-Q plot looks like a straight line, that means that your residuals and the residuals from a normal distribution match.

I think you were expecting to see the distribution of your errors. That is, you're looking for a histogram, not a Q-Q plot!

# Import pyplot
from matplotlib import pyplot as plt

# Plot histogram of residuals
plt.hist(residuals, bins=10)
plt.show()

histogram

In my case, it does not look like the residuals follow a normal distribution, but this will change on your end because you did not set a random state when generating y_values.

Cameron Riddell · Answer

Your problem is two-fold here

The primary problem is that sklearn (scikit learn) expects your input to be in a 2d columnar array, whereas qqplot from statsmodels expects your data to be in a true 1d array. When you're passing the residuals to qqplot it is attempting to transform each residual individually instead of an entire dataset
numpy.random.rand is a uniform distribution, so your errors aren't normal to begin with!

To highlight this, I've adapted your code sample. The top row in the resultant figure comprises predictions & residuals for a uniform residual distribution, whereas the bottom row uses a normal distribution for errors.

The difference between the "qq_bad" and "qq_good" plots simply has to do with selecting the column of data and passing it in as a true 1d array (instead of a 1d columnar array).

from matplotlib.pyplot import subplot_mosaic, show, rc
from matplotlib.lines import Line2D
from matplotlib.transforms import blended_transform_factory
from numpy.random import default_rng
from numpy import linspace
from sklearn.linear_model import LinearRegression
from statsmodels.api import qqplot
from scipy.stats import zscore

rc('font', size=14)
rc('axes.spines', top=False, right=False)

rng = default_rng(0)
size = 100

x_values = linspace(0, 5, size)[:, None]
errors = {
    'uniform': rng.uniform(low=-50, high=50, size=(size, 1)),
    'normal':  rng.normal(loc=0, scale=15, size=(size, 1))
}

fig, axd = subplot_mosaic([
    ['uniform_fit', 'uniform_hist', 'uniform_qq_bad', 'uniform_qq_good'],
    ['normal_fit', 'normal_hist', 'normal_qq_bad', 'normal_qq_good']
], figsize=(12, 6), gridspec_kw={'wspace': .4, 'hspace': .2})

for err_type, err in errors.items():
    reg = LinearRegression()
    y_values = 29 * x_values + 30 + err

    fit = reg.fit(x_values, y_values)
    predictions = fit.predict(x_values)
    residuals = predictions - y_values

    axd[f'{err_type}_fit'].scatter(x_values, y_values, s=10, alpha=.8)
    axd[f'{err_type}_fit'].plot(x_values, predictions)

    axd[f'{err_type}_hist'].hist(residuals, bins=20)

    qqplot(residuals, ax=axd[f'{err_type}_qq_bad'], line='q')
    qqplot(residuals[:, 0], ax=axd[f'{err_type}_qq_good'], line='q')

####
# Below is primarily for plot aesthetics, feel free to ignore

for label, ax in axd.items():
    ax.set_ylabel(None)
    ax.set_xlabel(None)

    if label.startswith('uniform'):
        ax.set_title(label.replace('uniform_', '').replace('_', ' '))

    if label.endswith('fit'):
        ax.set_ylabel(f'{label.replace("_fit", "")} error')

line = Line2D(
    [.05, .95], [1.04, 1.04],
    color='black',
    transform=blended_transform_factory(fig.transFigure, ax.transAxes),
)
fig.add_artist(line)

show()

enter image description here

Why is my Normal Q-Q Plot of residuals a vertical line?

Tags:

python

linear-regression

data-science

Tiago Pereira

2 Answers

Arturo Sbr

Cameron Riddell

Recent Activity

Donate For Us

Why is my Normal Q-Q Plot of residuals a vertical line?

Tags:

python

linear-regression

data-science

Tiago Pereira

2 Answers

Arturo Sbr

Cameron Riddell

Related questions

Recent Activity

Donate For Us