Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

efficiently passing dataframes as y and X to scikit-learn fits

I generate a pandas dataframe from read_sql_query. It has three columns, "results, speed, weight"

I want to use scikit-learn LinearRegression to fit results = f(speed, weight)

I haven't been able to find the correct syntax that would allow me to pass this dataframe, or column slices of it, to LinearRegression.fit(y, X).

print df['result'].shape
print df[['speed', 'weight']].shape
(8L,)
(8, 2)

but I cannot pass that to fit

lm.fit(df['result'], df[['speed', 'weight']])

It throws a deprecation warning and a ValueError

DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. 
ValueError: Found arrays with inconsistent numbers of samples: [1 8]

What is the efficient, clean way to take dataframes of targets and features, and pass them to fit operations?

This is how I generated the example:

import pandas as pd
import numpy as np
from datetime import datetime, timedelta

date_today = datetime.now()
days = pd.date_range(date_today, date_today + timedelta(7), freq='D')

np.random.seed(seed=1111)
data = np.random.randint(1, high=100, size=len(days))
data2 = np.random.randint(1, high=100, size=len(days))
data3 = np.random.randint(1, high=100, size=len(days))
df = pd.DataFrame({'test': days, 'result': data,'speed': data2,'weight': data3})
df = df.set_index('test')
print(df)
like image 896
user3556757 Avatar asked Dec 13 '22 19:12

user3556757


2 Answers

You are sending values in incorrect order. All scikit-learn estimators implementing fit() accept input X, y not y, X as you are doing.

Try this:

lm.fit(df[['speed', 'weight']], df['result'])
like image 192
Vivek Kumar Avatar answered Dec 27 '22 03:12

Vivek Kumar


First of all, fit() takes X, y and not y, X.

Second, it's important to remember is that Scikit-Learn exclusively works with array-like objects. It expects that X has shape (n_samples, n_features) and y to have shape (n_samples,)

It will check for these shapes when you use fit, so if your X, y don't abide by these rules, it will crash. Good news, X already has shape (5,2), but y will have shape (5, 1), which is different than (5,) and so your program might crash.

To be safe, I'd simply transform my X and y as numpy arrays from the start.

X = pd.DataFrame(np.ones((5, 2)))
y = pd.DataFrame(np.ones((5,)))

X = np.array(X)
y = np.array(y).squeeze()

For y to go from shape (5,1) to shape (5,), you need to use .squeeze() This will give you the right shapes and hopefully the program will run!

like image 34
Valentin Calomme Avatar answered Dec 27 '22 01:12

Valentin Calomme