New to Pandas, looking for the most efficient way to do this.
I have a Series of DataFrames. Each DataFrame has the same columns but different indexes, and they are indexed by date. The Series is indexed by ticker symbol. So each item in the Sequence represents a single time series of each individual stock's performance.
I need to randomly generate a list of n data frames, where each dataframe is a subset of some random assortment of the available stocks' histories. It's ok if there is overlap, so long as start end end dates are different.
This following code does it, but it's really slow, and I'm wondering if there's a better way to go about it:
Code
def random_sample(data=None, timesteps=100, batch_size=100, subset='train'):
if type(data) != pd.Series:
return None
if subset=='validate':
offset = 0
elif subset=='test':
offset = 200
elif subset=='train':
offset = 400
tickers = np.random.randint(0, len(data), size=len(data))
ret_data = []
while len(ret_data) != batch_size:
for t in tickers:
data_t = data[t]
max_len = len(data_t)-timesteps-1
if len(ret_data)==batch_size: break
if max_len-offset < 0: continue
index = np.random.randint(offset, max_len)
d = data_t[index:index+timesteps]
if len(d)==timesteps: ret_data.append(d)
return ret_data
Profile output:
Timer unit: 1e-06 s
File: finance.py
Function: random_sample at line 137
Total time: 0.016142 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
137 @profile
138 def random_sample(data=None, timesteps=100, batch_size=100, subset='train'):
139 1 5 5.0 0.0 if type(data) != pd.Series:
140 return None
141
142 1 1 1.0 0.0 if subset=='validate':
143 offset = 0
144 1 1 1.0 0.0 elif subset=='test':
145 offset = 200
146 1 0 0.0 0.0 elif subset=='train':
147 1 1 1.0 0.0 offset = 400
148
149 1 1835 1835.0 11.4 tickers = np.random.randint(0, len(data), size=len(data))
150
151 1 2 2.0 0.0 ret_data = []
152 2 3 1.5 0.0 while len(ret_data) != batch_size:
153 116 148 1.3 0.9 for t in tickers:
154 116 2497 21.5 15.5 data_t = data[t]
155 116 317 2.7 2.0 max_len = len(data_t)-timesteps-1
156 116 80 0.7 0.5 if len(ret_data)==batch_size: break
157 115 69 0.6 0.4 if max_len-offset < 0: continue
158
159 100 101 1.0 0.6 index = np.random.randint(offset, max_len)
160 100 10840 108.4 67.2 d = data_t[index:index+timesteps]
161 100 241 2.4 1.5 if len(d)==timesteps: ret_data.append(d)
162
163 1 1 1.0 0.0 return ret_data
Are you sure you need to find a faster method? Your current method isn't that slow. The following changes might simplify, but won't necessarily be any faster:
Step 1: Take a random sample (with replacement) from the list of dataframes
rand_stocks = np.random.randint(0, len(data), size=batch_size)
You can treat this array rand_stocks
as a list of indices to be applied to your Series of dataframes. The size is already batch size so that eliminates the need for the while loop and your comparison on line 156.
That is, you can iterate over rand_stocks
and access the stock like so:
for idx in rand_stocks:
stock = data.ix[idx]
# Get a sample from this stock.
Step 2: Get a random datarange for each stock you have randomly selected.
start_idx = np.random.randint(offset, len(stock)-timesteps)
d = data_t[start_idx:start_idx+timesteps]
I don't have your data, but here's how I put it together:
def random_sample(data=None, timesteps=100, batch_size=100, subset='train'):
if subset=='train': offset = 0 #you can obviously change this back
rand_stocks = np.random.randint(0, len(data), size=batch_size)
ret_data = []
for idx in rand_stocks:
stock = data[idx]
start_idx = np.random.randint(offset, len(stock)-timesteps)
d = stock[start_idx:start_idx+timesteps]
ret_data.append(d)
return ret_data
Creating a dataset:
In [22]: import numpy as np
In [23]: import pandas as pd
In [24]: rndrange = pd.DateRange('1/1/2012', periods=72, freq='H')
In [25]: rndseries = pd.Series(np.random.randn(len(rndrange)), index=rndrange)
In [26]: rndseries.head()
Out[26]:
2012-01-02 2.025795
2012-01-03 1.731667
2012-01-04 0.092725
2012-01-05 -0.489804
2012-01-06 -0.090041
In [27]: data = [rndseries,rndseries,rndseries,rndseries,rndseries,rndseries]
Testing the function:
In [42]: random_sample(data, timesteps=2, batch_size = 2)
Out[42]:
[2012-01-23 1.464576
2012-01-24 -1.052048,
2012-01-23 1.464576
2012-01-24 -1.052048]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With