I have some financial trading data for multiple products in CSV format that I would like to analyse using pandas. The trades happen at non-regular intervals and are timestamped to 1sec accuracy, which results in some trades occurring "at the same time", i.e. with identical timestamps.
The objective for the moment is to produce plots of the cummulative traded quantity for each product.
The trading data has been read into a DataFrame using read_csv(), index on parsed datetime.
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 447 entries, 2012-12-07 17:16:46 to 2012-12-10 16:28:29
Data columns:
Account Name 447 non-null values
Exchange 447 non-null values
Instrument 447 non-null values
Fill ID 447 non-null values
Side 447 non-null values
Quantity 447 non-null values
Price 447 non-null values
dtypes: float64(1), int64(1), object(5)
A little work is done to add a "QuantitySigned" column.
I've done a "groupby" so that I can access the data by instrument.
grouped = trades.groupby('Instrument', sort=True)
for name, group in grouped:
group.QuantitySigned.cumsum().plot(label=name)
plt.legend()
The above works, but I would like to have the TimeSeries (one per instrument) in one DataFrame, i.e. a column per instrument, so that I can use DataFrame.plot(). The problem is that no two TimeSeries have the exactly the same index, i.e. I would need to merge all the TimeSeries' indexes.
I know that this is supposed to work, given the trivial example below:
index=pd.date_range('2012-12-21', periods=5)
s1 = Series(randn(3), index=index[:3])
s2 = Series(randn(3), index=index[2:])
df = DataFrame(index=index)
df['s1'] = s1
df['s2'] = s2
However, an exception is thrown when attempting to aggregate the TimeSeries into a DataFrame and I believe it has to do with the duplicate index elements:
grouped = trades.groupby('Instrument', sort=True)
df = DataFrame(index=trades.index)
for name, group in grouped:
df[name] = group.QuantitySigned.cumsum()
df.plot()
Exception: Reindexing only valid with uniquely valued Index objects
Am I going about this "correctly"? Are there any suggestions on how to go about this in a better way?
Here is a runnable example that throws the exception:
import pandas as pd
from pandas import Series
from pandas import DataFrame
index = pd.tseries.index.DatetimeIndex(['2012-12-22', '2012-12-23', '2012-12-23'])
s1 = Series(randn(2), index[:2]) # No duplicate index elements
df1 = DataFrame(s1, index=index) # This works
s2 = Series(randn(2), index[-2:]) # Duplicate index elements
df2 = DataFrame(s2, index=index) # This throws
Thanks to @crewbum for the solution.
grouped = trades.groupby('Instrument', sort=True)
dflist = list()
for name, group in grouped:
dflist.append(DataFrame({name : group.QuantitySigned.cumsum()}))
results = pd.concat(dflist)
results = results.sort().ffill().fillna(0)
results.plot()
Note: I forward fill first before then setting remaining NaNs to zero. As @crewbum pointed out, ffill() and bfill() are new to 0.10.0.
I'm using:
pd.concat() performs an 'outer' join on the indexes by default and holes can be filled by padding forwards and/or backwards in time.
In [17]: pd.concat([DataFrame({'s1': s1}), DataFrame({'s2': s2})]).ffill().bfill()
Out[17]:
s1 s2
2012-12-21 9.0e-01 -0.3
2012-12-22 5.0e-03 -0.3
2012-12-23 -2.9e-01 -0.3
2012-12-23 -2.9e-01 -0.3
2012-12-24 -2.9e-01 -1.8
2012-12-25 -2.9e-01 -1.4
I should add that ffill()
and bfill()
are new in pandas 0.10.0. Prior to that, you can use fillna(method='ffill')
and fillna(method='bfill')
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With