Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Weighted linear regression with Scikit-learn

My data:

State           N           Var1            Var2
Alabama         23          54              42
Alaska          4           53              53
Arizona         53          75              65

Var1 and Var2 are aggregated percentage values at the state level. N is the number of participants in each state. I would like to run a linear regression between Var1 and Var2 with the consideration of N as weight with sklearn in Python 2.7.

The general line is:

fit(X, y[, sample_weight])

Say the data is loaded into df using Pandas and the N becomes df["N"], do I simply fit the data into the following line or do I need to process the N somehow before using it as sample_weight in the command?

fit(df["Var1"], df["Var2"], sample_weight=df["N"])
like image 406
KubiK888 Avatar asked Feb 06 '16 02:02

KubiK888


People also ask

Which Scikit learn class is used for linear regression?

sklearn.linear_model .LinearRegression. Ordinary least squares Linear Regression. LinearRegression fits a linear model with coefficients w = (w1, …, wp) to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation.

What is weighted regression analysis?

Weighted least squares (WLS), also known as weighted linear regression, is a generalization of ordinary least squares and linear regression in which knowledge of the variance of observations is incorporated into the regression. WLS is also a specialization of generalized least squares.


1 Answers

The weights enable training a model that is more accurate for certain values of the input (e.g., where the cost of error is higher). Internally, weights w are multiplied by the residuals in the loss function [1]:

enter image description here

Therefore, it is the relative scale of the weights that matters. N can be passed as is if it already reflects the priorities. Uniform scaling would not change the outcome.

Here is an example. In the weighted version, we emphasize the region around last two samples, and the model becomes more accurate there. And, scaling does not affect the outcome, as expected.

import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
from sklearn.linear_model import LinearRegression

# Load the diabetes dataset
X, y = datasets.load_diabetes(return_X_y=True)
n_samples = 20

# Use only one feature and sort
X = X[:, np.newaxis, 2][:n_samples]
y = y[:n_samples]
p = X.argsort(axis=0)
X = X[p].reshape((n_samples, 1))
y = y[p]

# Create equal weights and then augment the last 2 ones
sample_weight = np.ones(n_samples) * 20
sample_weight[-2:] *= 30

plt.scatter(X, y, s=sample_weight, c='grey', edgecolor='black')

# The unweighted model
regr = LinearRegression()
regr.fit(X, y)
plt.plot(X, regr.predict(X), color='blue', linewidth=3, label='Unweighted model')

# The weighted model
regr = LinearRegression()
regr.fit(X, y, sample_weight)
plt.plot(X, regr.predict(X), color='red', linewidth=3, label='Weighted model')

# The weighted model - scaled weights
regr = LinearRegression()
sample_weight = sample_weight / sample_weight.max()
regr.fit(X, y, sample_weight)
plt.plot(X, regr.predict(X), color='yellow', linewidth=2, label='Weighted model - scaled', linestyle='dashed')
plt.xticks(());plt.yticks(());plt.legend();

enter image description here

(this transformation also seems necessary for passing Var1 and Var2 to fit)

like image 125
Reveille Avatar answered Sep 19 '22 22:09

Reveille