Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Adding hetrogenous TimeSeries to a DataFrame

The Objective

I have some financial trading data for multiple products in CSV format that I would like to analyse using pandas. The trades happen at non-regular intervals and are timestamped to 1sec accuracy, which results in some trades occurring "at the same time", i.e. with identical timestamps.

The objective for the moment is to produce plots of the cummulative traded quantity for each product.

Current Progress

The trading data has been read into a DataFrame using read_csv(), index on parsed datetime.

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 447 entries, 2012-12-07 17:16:46 to 2012-12-10 16:28:29
Data columns:
Account Name    447  non-null values
Exchange        447  non-null values
Instrument      447  non-null values
Fill ID         447  non-null values
Side            447  non-null values
Quantity        447  non-null values
Price           447  non-null values
dtypes: float64(1), int64(1), object(5)

A little work is done to add a "QuantitySigned" column.

I've done a "groupby" so that I can access the data by instrument.

grouped = trades.groupby('Instrument', sort=True)
for name, group in grouped:
        group.QuantitySigned.cumsum().plot(label=name)
plt.legend()

The Question

The above works, but I would like to have the TimeSeries (one per instrument) in one DataFrame, i.e. a column per instrument, so that I can use DataFrame.plot(). The problem is that no two TimeSeries have the exactly the same index, i.e. I would need to merge all the TimeSeries' indexes.

I know that this is supposed to work, given the trivial example below:

index=pd.date_range('2012-12-21', periods=5)
s1 = Series(randn(3), index=index[:3])
s2 = Series(randn(3), index=index[2:])
df = DataFrame(index=index)
df['s1'] = s1
df['s2'] = s2

However, an exception is thrown when attempting to aggregate the TimeSeries into a DataFrame and I believe it has to do with the duplicate index elements:

grouped = trades.groupby('Instrument', sort=True)
df = DataFrame(index=trades.index)
for name, group in grouped:
        df[name] = group.QuantitySigned.cumsum()
df.plot()

Exception: Reindexing only valid with uniquely valued Index objects

Am I going about this "correctly"? Are there any suggestions on how to go about this in a better way?

Runnable Example

Here is a runnable example that throws the exception:

import pandas as pd
from pandas import Series
from pandas import DataFrame

index = pd.tseries.index.DatetimeIndex(['2012-12-22', '2012-12-23', '2012-12-23'])

s1 = Series(randn(2), index[:2]) # No duplicate index elements
df1 = DataFrame(s1, index=index) # This works

s2 = Series(randn(2), index[-2:]) # Duplicate index elements
df2 = DataFrame(s2, index=index) # This throws

The Solution

Thanks to @crewbum for the solution.

grouped = trades.groupby('Instrument', sort=True)
dflist = list()
for name, group in grouped:
    dflist.append(DataFrame({name : group.QuantitySigned.cumsum()}))
results = pd.concat(dflist)
results = results.sort().ffill().fillna(0)
results.plot()

Note: I forward fill first before then setting remaining NaNs to zero. As @crewbum pointed out, ffill() and bfill() are new to 0.10.0.

I'm using:

  • pandas 0.10.0
  • numpy 1.6.1
  • Python 2.7.3.
like image 908
commander.trout Avatar asked Dec 21 '12 03:12

commander.trout


1 Answers

pd.concat() performs an 'outer' join on the indexes by default and holes can be filled by padding forwards and/or backwards in time.

In [17]: pd.concat([DataFrame({'s1': s1}), DataFrame({'s2': s2})]).ffill().bfill()
Out[17]: 
                 s1   s2
2012-12-21  9.0e-01 -0.3
2012-12-22  5.0e-03 -0.3
2012-12-23 -2.9e-01 -0.3
2012-12-23 -2.9e-01 -0.3
2012-12-24 -2.9e-01 -1.8
2012-12-25 -2.9e-01 -1.4

I should add that ffill() and bfill() are new in pandas 0.10.0. Prior to that, you can use fillna(method='ffill') and fillna(method='bfill').

like image 192
Garrett Avatar answered Sep 26 '22 01:09

Garrett