Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python/Scikit-learn/regressions - from pandas Dataframes to Scikit prediction

I have the following pandas DataFrame, called main_frame:

            target_var  input1  input2  input3  input4  input5    input6
Date
2013-09-01        13.0     NaN     NaN     NaN     NaN     NaN       NaN   
2013-10-01        13.0     NaN     NaN     NaN     NaN     NaN       NaN   
2013-11-01        12.2     NaN     NaN     NaN     NaN     NaN       NaN   
2013-12-01        10.9     NaN     NaN     NaN     NaN     NaN       NaN   
2014-01-01        11.7       0      13      42       0       0        16   
2014-02-01        12.0      13       8      58       0       0        14   
2014-03-01        12.8      13      15     100       0       0        24   
2014-04-01        13.1       0      11      50      34       0        18   
2014-05-01        12.2      12      14      56      30      71        18   
2014-06-01        11.7      13      16      43      44       0        22   
2014-07-01        11.2       0      19      45      35       0        18   
2014-08-01        11.4      12      16      37      31       0        24   
2014-09-01        10.9      14      14      47      30      56        20   
2014-10-01        10.5      15      17      54      24      56        22   
2014-11-01        10.7      12      18      60      41      63        21   
2014-12-01         9.6      12      14      42      29      53        16   
2015-01-01        10.2      10      16      37      31       0        20   
2015-02-01        10.7      11      20      39      28       0        19   
2015-03-01        10.9      10      17      75      27      87        22   
2015-04-01        10.8      14      17      73      30      43        25   
2015-05-01        10.2      10      17      55      31      52        24

I've been having trouble to explore the dataset on Scikit-learn and I'm not sure if the problem is the pandas Dataset, the dates as index, the NaN's/Infs/Zeros (which I don't know how to solve), everything, something else I wasn't able to track.

I want to build a simple regression to predict the next target_var item based on the variables named "Input" (1,2,3..).

Note that there are a lot of zeros and NaN's in the time series, and eventually we might find Inf's as well.

like image 462
aabujamra Avatar asked Dec 27 '15 21:12

aabujamra


People also ask

Can scikit-learn use pandas DataFrame?

Generally, scikit-learn works on any numeric data stored as numpy arrays or scipy sparse matrices. Other types that are convertible to numeric arrays such as pandas DataFrame are also acceptable.

What is predict () Sklearn?

The Sklearn 'Predict' Method Predicts an Output That being the case, it provides a set of tools for doing things like training and evaluating machine learning models. And it also has tools to predict an output value, once the model is trained (for ML techniques that actually make predictions).

What is regression in scikit-learn?

Prerequisite: Linear Regression Linear Regression is a machine learning algorithm based on supervised learning. It performs a regression task. Regression models a target prediction value based on independent variables. It is mostly used for finding out the relationship between variables and forecasting.


1 Answers

You should first try to remove any row with a Inf, -Inf or NaN values (other methods include filling in the NaNs with, for example, the mean value of the feature).

df = df.replace(to_replace=[np.Inf, -np.Inf], value=np.NaN)
df = df.dropna()

Now, create a numpy matrix of you features and a vector of your targets. Given that your target variable is in the first column, you can use integer based indexing as follows:

X = df.iloc[:, 1:].values
y = df.iloc[:, 0].values

Then create and fit your model:

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X=X, y=y)

Now you can observe your estimates:

>>> model.intercept_
12.109583092421092

>>> model.coef_
array([-0.05269033, -0.17723251,  0.03627883,  0.02219596, -0.01377465,
        0.0111017 ])
like image 52
Alexander Avatar answered Oct 18 '22 23:10

Alexander