I am using a Q-Q Plot to test if the residuals of my linear regression follow a normal distribution but the result is a vertical line.
It looks like linear regression is a pretty good model for this dataset, so shouldn't the residuals be normally distributed?
The points were created randomly:
import numpy as np
x_values = np.linspace(0, 5, 100)[:, np.newaxis]
y_values = 29 * x_values + 30 * np.random.rand(100, 1)
Then, I fitted a Linear Regression model:
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(x_values, y_values)
predictions = reg.predict(x_values)
residuals = y_values - predictions
Finally, I used the statsmodel module to plot the Q-Q Plot of the residuals:
fig = sm.qqplot(residuals, line='45')
The whole idea of a Q-Q plot is to compare the quantiles of a true normal distribution against those of your residuals.
Hence, if the quantiles of the theoretical distribution (which is in fact normal) match those of your residuals (aka, they look like a straight line when plotted against each other), then you can conclude that the model from which you derived those residuals is a good fit.
shouldn't the residuals be normally distributed?
If you look at your code, you'll see that the random terms in y_values
are NOT normally distributed (you generated them with numpy.random.rand
, which does not follow a normal distribution).
Regardless, if your Q-Q plot looks like a straight line, that means that your residuals and the residuals from a normal distribution match.
I think you were expecting to see the distribution of your errors. That is, you're looking for a histogram, not a Q-Q plot!
# Import pyplot
from matplotlib import pyplot as plt
# Plot histogram of residuals
plt.hist(residuals, bins=10)
plt.show()
In my case, it does not look like the residuals follow a normal distribution, but this will change on your end because you did not set a random state when generating y_values
.
Your problem is two-fold here
The primary problem is that sklearn
(scikit learn) expects your input to be in a 2d columnar array, whereas qqplot
from statsmodels
expects your data to be in a true 1d array. When you're passing the residuals to qqplot
it is attempting to transform each residual individually instead of an entire dataset
numpy.random.rand
is a uniform distribution, so your errors aren't normal to begin with!
To highlight this, I've adapted your code sample. The top row in the resultant figure comprises predictions & residuals for a uniform residual distribution, whereas the bottom row uses a normal distribution for errors.
The difference between the "qq_bad"
and "qq_good"
plots simply has to do with selecting the column of data and passing it in as a true 1d array (instead of a 1d columnar array).
from matplotlib.pyplot import subplot_mosaic, show, rc
from matplotlib.lines import Line2D
from matplotlib.transforms import blended_transform_factory
from numpy.random import default_rng
from numpy import linspace
from sklearn.linear_model import LinearRegression
from statsmodels.api import qqplot
from scipy.stats import zscore
rc('font', size=14)
rc('axes.spines', top=False, right=False)
rng = default_rng(0)
size = 100
x_values = linspace(0, 5, size)[:, None]
errors = {
'uniform': rng.uniform(low=-50, high=50, size=(size, 1)),
'normal': rng.normal(loc=0, scale=15, size=(size, 1))
}
fig, axd = subplot_mosaic([
['uniform_fit', 'uniform_hist', 'uniform_qq_bad', 'uniform_qq_good'],
['normal_fit', 'normal_hist', 'normal_qq_bad', 'normal_qq_good']
], figsize=(12, 6), gridspec_kw={'wspace': .4, 'hspace': .2})
for err_type, err in errors.items():
reg = LinearRegression()
y_values = 29 * x_values + 30 + err
fit = reg.fit(x_values, y_values)
predictions = fit.predict(x_values)
residuals = predictions - y_values
axd[f'{err_type}_fit'].scatter(x_values, y_values, s=10, alpha=.8)
axd[f'{err_type}_fit'].plot(x_values, predictions)
axd[f'{err_type}_hist'].hist(residuals, bins=20)
qqplot(residuals, ax=axd[f'{err_type}_qq_bad'], line='q')
qqplot(residuals[:, 0], ax=axd[f'{err_type}_qq_good'], line='q')
####
# Below is primarily for plot aesthetics, feel free to ignore
for label, ax in axd.items():
ax.set_ylabel(None)
ax.set_xlabel(None)
if label.startswith('uniform'):
ax.set_title(label.replace('uniform_', '').replace('_', ' '))
if label.endswith('fit'):
ax.set_ylabel(f'{label.replace("_fit", "")} error')
line = Line2D(
[.05, .95], [1.04, 1.04],
color='black',
transform=blended_transform_factory(fig.transFigure, ax.transAxes),
)
fig.add_artist(line)
show()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With