Regression with Date variable using Scikit-learn

Tags:

I have a Pandas DataFrame with a date column (eg: 2013-04-01) of dtype datetime.date. When I include that column in X_train and try to fit the regression model, I get the error float() argument must be a string or a number. Removing the date column avoided this error.

What is the proper way to take the date into account in the regression model?

Code

data = sql.read_frame(...) X_train = data.drop('y', axis=1) y_train = data.y  rf = RandomForestRegressor().fit(X_train, y_train)

Error

TypeError                                 Traceback (most recent call last) <ipython-input-35-8bf6fc450402> in <module>() ----> 2 rf = RandomForestRegressor().fit(X_train, y_train)  C:\Python27\lib\site-packages\sklearn\ensemble\forest.pyc in fit(self, X, y, sample_weight)     292                 X.ndim != 2 or     293                 not X.flags.fortran): --> 294             X = array2d(X, dtype=DTYPE, order="F")     295      296         n_samples, self.n_features_ = X.shape  C:\Python27\lib\site-packages\sklearn\utils\validation.pyc in array2d(X, dtype, order, copy)      78         raise TypeError('A sparse matrix was passed, but dense data '      79                         'is required. Use X.toarray() to convert to dense.') ---> 80     X_2d = np.asarray(np.atleast_2d(X), dtype=dtype, order=order)      81     _assert_all_finite(X_2d)      82     if X is X_2d and copy:  C:\Python27\lib\site-packages\numpy\core\numeric.pyc in asarray(a, dtype, order)     318      319     """ --> 320     return array(a, dtype, copy=False, order=order)     321      322 def asanyarray(a, dtype=None, order=None):  TypeError: float() argument must be a string or a number

432

asked May 09 '13 03:05

Nyxynyx

1 Answers

The best way is to explode the date into a set of categorical features encoded in boolean form using the 1-of-K encoding (e.g. as done by DictVectorizer). Here are some features that can be extracted from a date:

hour of the day (24 boolean features)
day of the week (7 boolean features)
day of the month (up to 31 boolean features)
month of the year (12 boolean features)
year (as many boolean features as they are different years in your dataset) ...

That should make it possible to identify linear dependencies on periodic events on typical human life cycles.

Additionally you can also extract the date a single float: convert each date as the number of days since the min date of your training set and divide by the difference of the number of days between the max date and the number of days of the min date. That numerical feature should make it possible to identify long term trends between the output of the event date: e.g. a linear slope in a regression problem to better predict evolution on forth-coming years that cannot be encoded with the boolean categorical variable for the year feature.

133

answered Oct 08 '22 12:10

ogrisel

Related questions
                            
                                Understanding LDA implementation using gensim
                            
                                How to get only files in directory? [duplicate]
                            
                                X-Forwarded-Proto and Flask
                            
                                How to use Django's assertJSONEqual to verify response of view returning JsonResponse
                            
                                Is there a better way to guess possible unknown variables without brute force than I am doing? Machine learning? [duplicate]
                            
                                AttributeError: can't set attribute when connecting to sqlite database with flask-sqlalchemy
                            
                                How to Check if request.GET var is None?
                            
                                Get "2:35pm" instead of "02:35PM" from Python date/time?
                            
                                python subclassing multiprocessing.Process
                            
                                NoSQL Solution for Persisting Graphs at Scale
                            
                                How do I close the files from tempfile.mkstemp?
                            
                                What is the meaning of the nu parameter in Scikit-Learn's SVM class?
                            
                                How can I convert a string into a date object and get year, month and day separately?
                            
                                Is there a Python dict without values?
                            
                                Flask WTForms: Difference between DataRequired and InputRequired
                            
                                How to install the png module in python
                            
                                Running Job On Airflow Based On Webrequest
                            
                                Python: ImportError: lxml not found, please install it
                            
                                Best way to convert a Unicode URL to ASCII (UTF-8 percent-escaped) in Python?
                            
                                Can I count on order being preserved in a Python tuple?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Regression with Date variable using Scikit-learn

Tags:

python

pandas

numpy

python-2.7

scikit-learn

Nyxynyx

People also ask

1 Answers

ogrisel

Recent Activity

Donate For Us