Scikit-learn : Input contains NaN, infinity or a value too large for dtype ('float64')

Tags:

I'm using Python scikit-learn for simple linear regression on data obtained from csv.

reader = pandas.io.parsers.read_csv("data/all-stocks-cleaned.csv")
stock = np.array(reader)

openingPrice = stock[:, 1]
closingPrice = stock[:, 5]

print((np.min(openingPrice)))
print((np.min(closingPrice)))
print((np.max(openingPrice)))
print((np.max(closingPrice)))

peningPriceTrain, openingPriceTest, closingPriceTrain, closingPriceTest = \
    train_test_split(openingPrice, closingPrice, test_size=0.25, random_state=42)


openingPriceTrain = np.reshape(openingPriceTrain,(openingPriceTrain.size,1))

openingPriceTrain = openingPriceTrain.astype(np.float64, copy=False)
# openingPriceTrain = np.arange(openingPriceTrain, dtype=np.float64)

closingPriceTrain = np.reshape(closingPriceTrain,(closingPriceTrain.size,1))
closingPriceTrain = closingPriceTrain.astype(np.float64, copy=False)

openingPriceTest = np.reshape(openingPriceTest,(openingPriceTest.size,1))
closingPriceTest = np.reshape(closingPriceTest,(closingPriceTest.size,1))

regression = linear_model.LinearRegression()

regression.fit(openingPriceTrain, closingPriceTrain)

predicted = regression.predict(openingPriceTest)

The min and max values are showed as 0.0 0.6 41998.0 2593.9

Yet I'm getting this error ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

How should I remove this error? Because from the above result it is true that it doesn't contain infinites or Nan values.

What's the solution for this?

Edit: all-stocks-cleaned.csv is avaliabale at http://www.sharecsv.com/s/cb31790afc9b9e33c5919cdc562630f3/all-stocks-cleaned.csv

248

asked Jan 14 '16 01:01

Vishwajeet Vatharkar

1 Answers

The problem with your regression is that somehow NaN's have sneaked into your data. This could be easily checked with the following code snippet:

import pandas as pd
import numpy as np
from  sklearn import linear_model
from sklearn.cross_validation import train_test_split

reader = pd.io.parsers.read_csv("./data/all-stocks-cleaned.csv")
stock = np.array(reader)

openingPrice = stock[:, 1]
closingPrice = stock[:, 5]

openingPriceTrain, openingPriceTest, closingPriceTrain, closingPriceTest = \
    train_test_split(openingPrice, closingPrice, test_size=0.25, random_state=42)

openingPriceTrain = openingPriceTrain.reshape(openingPriceTrain.size,1)
openingPriceTrain = openingPriceTrain.astype(np.float64, copy=False)

closingPriceTrain = closingPriceTrain.reshape(closingPriceTrain.size,1)
closingPriceTrain = closingPriceTrain.astype(np.float64, copy=False)

openingPriceTest = openingPriceTest.reshape(openingPriceTest.size,1)
openingPriceTest = openingPriceTest.astype(np.float64, copy=False)

np.isnan(openingPriceTrain).any(), np.isnan(closingPriceTrain).any(), np.isnan(openingPriceTest).any()

(True, True, True)

If you try imputing missing values like below:

openingPriceTrain[np.isnan(openingPriceTrain)] = np.median(openingPriceTrain[~np.isnan(openingPriceTrain)])
closingPriceTrain[np.isnan(closingPriceTrain)] = np.median(closingPriceTrain[~np.isnan(closingPriceTrain)])
openingPriceTest[np.isnan(openingPriceTest)] = np.median(openingPriceTest[~np.isnan(openingPriceTest)])

your regression will run smoothly without a problem:

regression = linear_model.LinearRegression()

regression.fit(openingPriceTrain, closingPriceTrain)

predicted = regression.predict(openingPriceTest)

predicted[:5]

array([[ 13598.74748173],
       [ 53281.04442146],
       [ 18305.4272186 ],
       [ 50753.50958453],
       [ 14937.65782778]])

In short: you have missing values in your data, as the error message said.

EDIT::

perhaps an easier and more straightforward approach would be to check if you have any missing data right after you read the data with pandas:

data = pd.read_csv('./data/all-stocks-cleaned.csv')
data.isnull().any()
Date                    False
Open                     True
High                     True
Low                      True
Last                     True
Close                    True
Total Trade Quantity     True
Turnover (Lacs)          True

and then impute the data with any of the two lines below:

data = data.fillna(lambda x: x.median())

data = data.fillna(method='ffill')

187

answered Nov 15 '22 17:11

Sergey Bushmanov

Related questions
                            
                                Using an index to get an item
                            
                                Django model inheritance and type check
                            
                                Which is generally faster, a yield or an append?
                            
                                How to import multiple locations to PYTHONPATH (bash)
                            
                                Get the version of Django for application
                            
                                Create an instance, I already have the type
                            
                                Trying to catch integrity error with SQLAlchemy
                            
                                Python's StringIO doesn't do well with `with` statements
                            
                                How to parse positional arguments with leading minus sign (negative numbers) using argparse
                            
                                TypeError: object of type 'Cursor' has no len()
                            
                                Python Pandas, write DataFrame to fixed-width file (to_fwf?)
                            
                                Importing large tab-delimited .txt file into Python
                            
                                Redis: Return all values stored in a database
                            
                                Numpy build fails with cannot import multiarray
                            
                                How do I remove Label text in Django generated form?
                            
                                How to signal slots in a GUI from a different process?
                            
                                ploting filled polygons in python
                            
                                User ID to Username tweepy
                            
                                How can i get all models in django 1.8
                            
                                What does this: s[s[1:] == s[:-1]] do in numpy?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scikit-learn : Input contains NaN, infinity or a value too large for dtype ('float64')

Tags:

python

machine-learning

numpy

scikit-learn

Vishwajeet Vatharkar

People also ask

1 Answers

Sergey Bushmanov

Recent Activity

Donate For Us