I'm using Python scikit-learn for simple linear regression on data obtained from csv.
reader = pandas.io.parsers.read_csv("data/all-stocks-cleaned.csv")
stock = np.array(reader)
openingPrice = stock[:, 1]
closingPrice = stock[:, 5]
print((np.min(openingPrice)))
print((np.min(closingPrice)))
print((np.max(openingPrice)))
print((np.max(closingPrice)))
peningPriceTrain, openingPriceTest, closingPriceTrain, closingPriceTest = \
train_test_split(openingPrice, closingPrice, test_size=0.25, random_state=42)
openingPriceTrain = np.reshape(openingPriceTrain,(openingPriceTrain.size,1))
openingPriceTrain = openingPriceTrain.astype(np.float64, copy=False)
# openingPriceTrain = np.arange(openingPriceTrain, dtype=np.float64)
closingPriceTrain = np.reshape(closingPriceTrain,(closingPriceTrain.size,1))
closingPriceTrain = closingPriceTrain.astype(np.float64, copy=False)
openingPriceTest = np.reshape(openingPriceTest,(openingPriceTest.size,1))
closingPriceTest = np.reshape(closingPriceTest,(closingPriceTest.size,1))
regression = linear_model.LinearRegression()
regression.fit(openingPriceTrain, closingPriceTrain)
predicted = regression.predict(openingPriceTest)
The min and max values are showed as 0.0 0.6 41998.0 2593.9
Yet I'm getting this error ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
How should I remove this error? Because from the above result it is true that it doesn't contain infinites or Nan values.
What's the solution for this?
Edit: all-stocks-cleaned.csv is avaliabale at http://www.sharecsv.com/s/cb31790afc9b9e33c5919cdc562630f3/all-stocks-cleaned.csv
– Having Nan and Infinity in the Input Data. To remove the NaN and infinity in the input data, you need to get a boolean mask back with true for positions containing NaNs, and for that, you can use no. isnan(X). Note that you also need to get back a tuple with i, j coordinates of NaNs, and for that, you can use np.
ValueError: Input contains infinity or a value too large for dtype('float64'). This error usually occurs when you attempt to use some function from the scikit-learn module, but the DataFrame or matrix you're using as input has NaN values or infinite values.
The problem with your regression is that somehow NaN
's have sneaked into your data. This could be easily checked with the following code snippet:
import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn.cross_validation import train_test_split
reader = pd.io.parsers.read_csv("./data/all-stocks-cleaned.csv")
stock = np.array(reader)
openingPrice = stock[:, 1]
closingPrice = stock[:, 5]
openingPriceTrain, openingPriceTest, closingPriceTrain, closingPriceTest = \
train_test_split(openingPrice, closingPrice, test_size=0.25, random_state=42)
openingPriceTrain = openingPriceTrain.reshape(openingPriceTrain.size,1)
openingPriceTrain = openingPriceTrain.astype(np.float64, copy=False)
closingPriceTrain = closingPriceTrain.reshape(closingPriceTrain.size,1)
closingPriceTrain = closingPriceTrain.astype(np.float64, copy=False)
openingPriceTest = openingPriceTest.reshape(openingPriceTest.size,1)
openingPriceTest = openingPriceTest.astype(np.float64, copy=False)
np.isnan(openingPriceTrain).any(), np.isnan(closingPriceTrain).any(), np.isnan(openingPriceTest).any()
(True, True, True)
If you try imputing missing values like below:
openingPriceTrain[np.isnan(openingPriceTrain)] = np.median(openingPriceTrain[~np.isnan(openingPriceTrain)])
closingPriceTrain[np.isnan(closingPriceTrain)] = np.median(closingPriceTrain[~np.isnan(closingPriceTrain)])
openingPriceTest[np.isnan(openingPriceTest)] = np.median(openingPriceTest[~np.isnan(openingPriceTest)])
your regression will run smoothly without a problem:
regression = linear_model.LinearRegression()
regression.fit(openingPriceTrain, closingPriceTrain)
predicted = regression.predict(openingPriceTest)
predicted[:5]
array([[ 13598.74748173],
[ 53281.04442146],
[ 18305.4272186 ],
[ 50753.50958453],
[ 14937.65782778]])
In short: you have missing values in your data, as the error message said.
EDIT::
perhaps an easier and more straightforward approach would be to check if you have any missing data right after you read the data with pandas:
data = pd.read_csv('./data/all-stocks-cleaned.csv')
data.isnull().any()
Date False
Open True
High True
Low True
Last True
Close True
Total Trade Quantity True
Turnover (Lacs) True
and then impute the data with any of the two lines below:
data = data.fillna(lambda x: x.median())
or
data = data.fillna(method='ffill')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With