I have a Pandas DataFrame with a date
column (eg: 2013-04-01
) of dtype datetime.date
. When I include that column in X_train
and try to fit the regression model, I get the error float() argument must be a string or a number
. Removing the date
column avoided this error.
What is the proper way to take the date
into account in the regression model?
Code
data = sql.read_frame(...) X_train = data.drop('y', axis=1) y_train = data.y rf = RandomForestRegressor().fit(X_train, y_train)
Error
TypeError Traceback (most recent call last) <ipython-input-35-8bf6fc450402> in <module>() ----> 2 rf = RandomForestRegressor().fit(X_train, y_train) C:\Python27\lib\site-packages\sklearn\ensemble\forest.pyc in fit(self, X, y, sample_weight) 292 X.ndim != 2 or 293 not X.flags.fortran): --> 294 X = array2d(X, dtype=DTYPE, order="F") 295 296 n_samples, self.n_features_ = X.shape C:\Python27\lib\site-packages\sklearn\utils\validation.pyc in array2d(X, dtype, order, copy) 78 raise TypeError('A sparse matrix was passed, but dense data ' 79 'is required. Use X.toarray() to convert to dense.') ---> 80 X_2d = np.asarray(np.atleast_2d(X), dtype=dtype, order=order) 81 _assert_all_finite(X_2d) 82 if X is X_2d and copy: C:\Python27\lib\site-packages\numpy\core\numeric.pyc in asarray(a, dtype, order) 318 319 """ --> 320 return array(a, dtype, copy=False, order=order) 321 322 def asanyarray(a, dtype=None, order=None): TypeError: float() argument must be a string or a number
As far as a technical answer to the question, yes you could use dates on the x-axis, or if the specific software you are using does not allow it, set the first day to day 0 and then convert the dates to number of days since the first day.
You can convert the date to an ordinal i.e. an integer representing the number of days since year 1 day 1. You can do this by a datetime. date 's toordinal function. Alternatively, you can turn the dates into categorical variables using sklearn's OneHotEncoder.
LinearRegression fits a linear model with coefficients w = (w1, …, wp) to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation. Whether to calculate the intercept for this model.
Photo Credit: Scikit-Learn. Logistic Regression is a Machine Learning classification algorithm that is used to predict the probability of a categorical dependent variable. In logistic regression, the dependent variable is a binary variable that contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.).
The best way is to explode the date into a set of categorical features encoded in boolean form using the 1-of-K encoding (e.g. as done by DictVectorizer). Here are some features that can be extracted from a date:
That should make it possible to identify linear dependencies on periodic events on typical human life cycles.
Additionally you can also extract the date a single float: convert each date as the number of days since the min date of your training set and divide by the difference of the number of days between the max date and the number of days of the min date. That numerical feature should make it possible to identify long term trends between the output of the event date: e.g. a linear slope in a regression problem to better predict evolution on forth-coming years that cannot be encoded with the boolean categorical variable for the year feature.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With