Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Pandas -- Random sampling of time series

Tags:

python

pandas

New to Pandas, looking for the most efficient way to do this.

I have a Series of DataFrames. Each DataFrame has the same columns but different indexes, and they are indexed by date. The Series is indexed by ticker symbol. So each item in the Sequence represents a single time series of each individual stock's performance.

I need to randomly generate a list of n data frames, where each dataframe is a subset of some random assortment of the available stocks' histories. It's ok if there is overlap, so long as start end end dates are different.

This following code does it, but it's really slow, and I'm wondering if there's a better way to go about it:

Code

def random_sample(data=None, timesteps=100, batch_size=100, subset='train'):
    if type(data) != pd.Series:
        return None

    if subset=='validate':
        offset = 0
    elif subset=='test':
        offset = 200
    elif subset=='train':
        offset = 400

    tickers = np.random.randint(0, len(data), size=len(data))

    ret_data = []
    while len(ret_data) != batch_size:
        for t in tickers:
            data_t = data[t]
            max_len = len(data_t)-timesteps-1
            if len(ret_data)==batch_size: break
            if max_len-offset < 0: continue

            index = np.random.randint(offset, max_len)
            d = data_t[index:index+timesteps]
            if len(d)==timesteps: ret_data.append(d)

    return ret_data

Profile output:

Timer unit: 1e-06 s

File: finance.py
Function: random_sample at line 137
Total time: 0.016142 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   137                                           @profile
   138                                           def random_sample(data=None, timesteps=100, batch_size=100, subset='train'):
   139         1            5      5.0      0.0      if type(data) != pd.Series:
   140                                                   return None
   141
   142         1            1      1.0      0.0      if subset=='validate':
   143                                                   offset = 0
   144         1            1      1.0      0.0      elif subset=='test':
   145                                                   offset = 200
   146         1            0      0.0      0.0      elif subset=='train':
   147         1            1      1.0      0.0          offset = 400
   148
   149         1         1835   1835.0     11.4      tickers = np.random.randint(0, len(data), size=len(data))
   150
   151         1            2      2.0      0.0      ret_data = []
   152         2            3      1.5      0.0      while len(ret_data) != batch_size:
   153       116          148      1.3      0.9          for t in tickers:
   154       116         2497     21.5     15.5              data_t = data[t]
   155       116          317      2.7      2.0              max_len = len(data_t)-timesteps-1
   156       116           80      0.7      0.5              if len(ret_data)==batch_size: break
   157       115           69      0.6      0.4              if max_len-offset < 0: continue
   158
   159       100          101      1.0      0.6              index = np.random.randint(offset, max_len)
   160       100        10840    108.4     67.2              d = data_t[index:index+timesteps]
   161       100          241      2.4      1.5              if len(d)==timesteps: ret_data.append(d)
   162
   163         1            1      1.0      0.0      return ret_data
like image 464
Dave S Avatar asked Nov 12 '22 18:11

Dave S


1 Answers

Are you sure you need to find a faster method? Your current method isn't that slow. The following changes might simplify, but won't necessarily be any faster:

Step 1: Take a random sample (with replacement) from the list of dataframes

rand_stocks = np.random.randint(0, len(data), size=batch_size) 

You can treat this array rand_stocks as a list of indices to be applied to your Series of dataframes. The size is already batch size so that eliminates the need for the while loop and your comparison on line 156.

That is, you can iterate over rand_stocks and access the stock like so:

for idx in rand_stocks: 
  stock = data.ix[idx] 
  # Get a sample from this stock. 

Step 2: Get a random datarange for each stock you have randomly selected.

start_idx = np.random.randint(offset, len(stock)-timesteps)
d = data_t[start_idx:start_idx+timesteps]

I don't have your data, but here's how I put it together:

def random_sample(data=None, timesteps=100, batch_size=100, subset='train'):
    if subset=='train': offset = 0  #you can obviously change this back
    rand_stocks = np.random.randint(0, len(data), size=batch_size)
    ret_data = []
    for idx in rand_stocks:
        stock = data[idx]
        start_idx = np.random.randint(offset, len(stock)-timesteps)
        d = stock[start_idx:start_idx+timesteps]
        ret_data.append(d)
    return ret_data

Creating a dataset:

In [22]: import numpy as np
In [23]: import pandas as pd

In [24]: rndrange = pd.DateRange('1/1/2012', periods=72, freq='H')
In [25]: rndseries = pd.Series(np.random.randn(len(rndrange)), index=rndrange)
In [26]: rndseries.head()
Out[26]:
2012-01-02    2.025795
2012-01-03    1.731667
2012-01-04    0.092725
2012-01-05   -0.489804
2012-01-06   -0.090041

In [27]: data = [rndseries,rndseries,rndseries,rndseries,rndseries,rndseries]

Testing the function:

In [42]: random_sample(data, timesteps=2, batch_size = 2)
Out[42]:
[2012-01-23    1.464576
2012-01-24   -1.052048,
 2012-01-23    1.464576
2012-01-24   -1.052048]
like image 196
Aman Avatar answered Nov 15 '22 11:11

Aman