Weighted linear regression with Scikit-learn

Tags:

My data:

State           N           Var1            Var2
Alabama         23          54              42
Alaska          4           53              53
Arizona         53          75              65

Var1 and Var2 are aggregated percentage values at the state level. N is the number of participants in each state. I would like to run a linear regression between Var1 and Var2 with the consideration of N as weight with sklearn in Python 2.7.

The general line is:

fit(X, y[, sample_weight])

Say the data is loaded into df using Pandas and the N becomes df["N"], do I simply fit the data into the following line or do I need to process the N somehow before using it as sample_weight in the command?

fit(df["Var1"], df["Var2"], sample_weight=df["N"])

406

asked Feb 06 '16 02:02

KubiK888

1 Answers

The weights enable training a model that is more accurate for certain values of the input (e.g., where the cost of error is higher). Internally, weights w are multiplied by the residuals in the loss function [1]:

enter image description here

Therefore, it is the relative scale of the weights that matters. N can be passed as is if it already reflects the priorities. Uniform scaling would not change the outcome.

Here is an example. In the weighted version, we emphasize the region around last two samples, and the model becomes more accurate there. And, scaling does not affect the outcome, as expected.

import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
from sklearn.linear_model import LinearRegression

# Load the diabetes dataset
X, y = datasets.load_diabetes(return_X_y=True)
n_samples = 20

# Use only one feature and sort
X = X[:, np.newaxis, 2][:n_samples]
y = y[:n_samples]
p = X.argsort(axis=0)
X = X[p].reshape((n_samples, 1))
y = y[p]

# Create equal weights and then augment the last 2 ones
sample_weight = np.ones(n_samples) * 20
sample_weight[-2:] *= 30

plt.scatter(X, y, s=sample_weight, c='grey', edgecolor='black')

# The unweighted model
regr = LinearRegression()
regr.fit(X, y)
plt.plot(X, regr.predict(X), color='blue', linewidth=3, label='Unweighted model')

# The weighted model
regr = LinearRegression()
regr.fit(X, y, sample_weight)
plt.plot(X, regr.predict(X), color='red', linewidth=3, label='Weighted model')

# The weighted model - scaled weights
regr = LinearRegression()
sample_weight = sample_weight / sample_weight.max()
regr.fit(X, y, sample_weight)
plt.plot(X, regr.predict(X), color='yellow', linewidth=2, label='Weighted model - scaled', linestyle='dashed')
plt.xticks(());plt.yticks(());plt.legend();

enter image description here

(this transformation also seems necessary for passing Var1 and Var2 to fit)

125

answered Sep 19 '22 22:09

Reveille

Related questions
                            
                                Django .aggregate() on .annotate()
                            
                                Python, sharing mysql connection in multiple functions - pass connection or cursor?
                            
                                How to perform discrete optimization of functions over matrices?
                            
                                Why should I discard half of what a FFT returns?
                            
                                Annoying generator bug
                            
                                How to encode Python 3 string using \u escape code?
                            
                                Internal Redirect in Flask
                            
                                Macro Recursion in Jinja2
                            
                                listcomp unable to access locals defined in code called by exec if nested in function
                            
                                PUT Request to REST API using Python
                            
                                Python, PEP-8, E122 continuation line missing indentation or outdented
                            
                                how do I make a 2.7 python context manager threadsafe
                            
                                Equivalent of Python's Counter in Ruby
                            
                                Python socket.error: [Errno 104] Connection reset by peer
                            
                                How to debug when using multiprocessing in pycharm
                            
                                Making a meaningful sentence from a given set of words [closed]
                            
                                Why is sorting a python list of tuples faster when I explicitly provide the key as the first element?
                            
                                SQLAlchemy: Using a CTE from a (sub)query w/ FROM clause specified as literal text
                            
                                Dependency rule tried to blank-out primary key column in SQL-Alchemy when trying to delete record
                            
                                Load directly gz file into pandas dataframe

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Weighted linear regression with Scikit-learn

Tags:

python

machine-learning

python-2.7

scikit-learn

regression

KubiK888

People also ask

1 Answers

Reveille

Recent Activity

Donate For Us