Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ValueError: Found array with 0 sample (s) (shape= (0, 1) while a minimum of 1 is required by MinMaxScaler

This semester I started working with ML. We have only used APIs such as Microsoft's Azure and Amazon's AWS, but we have not gone in depth about how those services work. My good friend, who is a Math major senior, asked me to help him create a stock predictor with TensorFlow based on a .csv the file he provided me.

There are a few problems I have. The first one is his .csv file. The file has only dates and closing values, which are not separated, therefore I had to manually separate the dates and values. I've managed to do that, and now I'm having trouble with the MinMaxScaler(). I was told I could pretty much disregard the dates and only test the closing values, normalize them, and make a prediction based off of them.

I keep getting this error:

ValueError: Found array with 0 sample(s) (shape=(0, 1)) while a minimum of 1 is required by MinMaxScaler()

I honestly have not ever used SKLearning and TensorFlow before, and it is my first time working on such a project. All the guides I see on the topic utilize pandas, but in my case, the .csv file is a mess and I don't believe I can use pandas for it.

I'm following this guide:

But unfortunately, due to my lack of experience, some things are not really working for me, and I would appreciate a little more clarity of how I should proceed in my case.

Attached below is my (messy) code:

import pandas as pd
import numpy as np
import tensorflow as tf
import sklearn
from sklearn.model_selection import KFold
from sklearn.preprocessing import scale
from sklearn.preprocessing import MinMaxScaler
import matplotlib
import matplotlib.pyplot as plt
from dateutil.parser import parse
from datetime import datetime, timedelta
from collections import deque

stock_data = []
stock_date = []
stock_value = []
f = open("s&p500closing.csv","r")
data = f.read()
rows = data.split("\n")
rows_noheader = rows[1:len(rows)]

#Separating values from messy `.csv`, putting each value to it's list and also a combined list of both
for row in rows_noheader:
    [date, value] = row[1:len(row)-1].split('\t')
    stock_date.append(date)
    stock_value.append((value))
    stock_data.append((date, value))

#Numpy array of all closing values converted to floats and normalized against the maximum
stock_value = np.array(stock_value, dtype=np.float32)
normvalue = [i/max(stock_value) for i in stock_value]

#Number of closing values and days. Since there is one closing value for each, they both match and there are 4528 of them (each)
nclose_and_days = 0
for i in range(len(stock_data)):
    nclose_and_days+=1

train_data = stock_value[:2264]
test_data = stock_value[2264:]

scaler = MinMaxScaler()

train_data = train_data.reshape(-1,1)
test_data = test_data.reshape(-1,1)

# Train the Scaler with training data and smooth data
smoothing_window_size = 1100
for di in range(0,4400,smoothing_window_size):
    #error occurs here
    scaler.fit(train_data[di:di+smoothing_window_size,:])
    train_data[di:di+smoothing_window_size,:] = scaler.transform(train_data[di:di+smoothing_window_size,:])

# You normalize the last bit of remaining data
scaler.fit(train_data[di+smoothing_window_size:,:])
train_data[di+smoothing_window_size:,:] = scaler.transform(train_data[di+smoothing_window_size:,:])

# Reshape both train and test data
train_data = train_data.reshape(-1)

# Normalize test data
test_data = scaler.transform(test_data).reshape(-1)

# Now perform exponential moving average smoothing
# So the data will have a smoother curve than the original ragged data
EMA = 0.0
gamma = 0.1
for ti in range(1100):
    EMA = gamma*train_data[ti] + (1-gamma)*EMA
    train_data[ti] = EMA

# Used for visualization and test purposes
all_mid_data = np.concatenate([train_data,test_data],axis=0)

window_size = 100
N = train_data.size
std_avg_predictions = []
std_avg_x = []
mse_errors = []

for pred_idx in range(window_size,N):
    std_avg_predictions.append(np.mean(train_data[pred_idx-window_size:pred_idx]))
    mse_errors.append((std_avg_predictions[-1]-train_data[pred_idx])**2)
    std_avg_x.append(date)

print('MSE error for standard averaging: %.5f'%(0.5*np.mean(mse_errors)))
like image 487
Daniel Vaindiner Avatar asked Nov 21 '18 22:11

Daniel Vaindiner


3 Answers

I know that this post is old, but as I stumbled here, others will.. After running in the same problem and googling quite a bit I found a post https://github.com/llSourcell/Make_Money_with_Tensorflow_2.0/issues/7

so it seems that if you download a too small dataset it will throw that error. Download a .csv from 1962 and it'll be big enough ;).

Now,I just have to find the right parameters for my dataset..as I'm adapting this to another type o prediction.. Hope it helps

like image 179
Vincenzo Avatar answered Nov 02 '22 20:11

Vincenzo


The train_data variable has a length of 2264:

train_data = stock_value[:2264]

Then, when you go to fit the scaler, you go outside of train_data's bounds on the third iteration of the for loop:

smoothing_window_size = 1100
for di in range(0, 4400, smoothing_window_size):

Notice the size of the data set in the tutorial. The training and testing chunks each have length 11,000, and the smoothing_window_size is 2500, so it will never go exceed train_data's boundaries.

like image 2
joaosantos Avatar answered Nov 02 '22 19:11

joaosantos


You have a column of all 0's in your data. If you try to scale it the MinMaxScaler can't assign a scale and it trips up. You need to filter out empty/0 columns before you scale the data. Try :

    stock_value=stock_value[:,~np.all(np.isnan(d), axis=0)]

to filter out nan columns in your data

like image 2
BeardySam Avatar answered Nov 02 '22 18:11

BeardySam