Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas DataFrame - desired index has duplicate values

This is my first time trying Pandas. I think I have a reasonable use case, but I am stumbling. I want to load a tab delimited file into a Pandas Dataframe, then group it by Symbol and plot it with the x.axis indexed by the TimeStamp column. Here is a subset of the data:

Symbol,Price,M1,M2,Volume,TimeStamp
TBET,2.19,3,8.05,1124179,9:59:14 AM
FUEL,3.949,9,1.15,109674,9:59:11 AM
SUNH,4.37,6,0.09,24394,9:59:09 AM
FUEL,3.9099,8,1.11,105265,9:59:09 AM
TBET,2.18,2,8.03,1121629,9:59:05 AM
ORBC,3.4,2,0.22,10509,9:59:02 AM
FUEL,3.8599,7,1.07,102116,9:58:47 AM
FUEL,3.8544,6,1.05,100116,9:58:40 AM
GBR,3.83,4,0.46,64251,9:58:24 AM
GBR,3.8,3,0.45,63211,9:58:20 AM
XRA,3.6167,3,0.12,42310,9:58:08 AM
GBR,3.75,2,0.34,47521,9:57:52 AM
MPET,1.42,3,0.26,44600,9:57:52 AM

Note two things about the TimeStamp column;

  1. it has duplicate values and
  2. the intervals are irregular.

I thought I could do something like this...

from pandas import *
import pylab as plt

df = read_csv('data.txt',index_col=5)
df.sort(ascending=False)

df.plot()
plt.show()

But the read_csv method raises an exception "Tried columns 1-X as index but found duplicates". Is there an option that will allow me to specify an index column with duplicate values?

I would also be interested in aligning my irregular timestamp intervals to one second resolution, I would still wish to plot multiple events for a given second, but maybe I could introduce a unique index, then align my prices to it?

like image 375
kavu Avatar asked Mar 04 '12 16:03

kavu


People also ask

Can pandas index have duplicate values?

Indicate duplicate index values. Duplicated values are indicated as True values in the resulting array. Either all duplicates, all except the first, or all except the last occurrence of duplicates can be indicated. The value or values in a set of duplicates to mark as missing.

How do I know if a DataFrame has a duplicate index?

To check if the index has duplicate values, use the index. has_duplicates property in Pandas.

Can index have duplicate values?

Yes, you can create a clustered index on key columns that contain duplicate values.


1 Answers

I created several issues just now to address some features / conveniences that I think would be nice to have: GH-856, GH-857, GH-858

We're currently working on a revamp of the time series capabilities and doing alignment to secondly resolution is possible now (though not with duplicates, so would need to write some functions for that). I also want to support duplicate timestamps in a better way. However, this is really panel (3D) data, so one way that you might alter things is the following:

In [29]: df.pivot('Symbol', 'TimeStamp').stack()
Out[29]: 
                   M1    M2   Price   Volume
Symbol TimeStamp                            
FUEL   9:58:40 AM   6  1.05  3.8544   100116
       9:58:47 AM   7  1.07  3.8599   102116
       9:59:09 AM   8  1.11  3.9099   105265
       9:59:11 AM   9  1.15  3.9490   109674
GBR    9:57:52 AM   2  0.34  3.7500    47521
       9:58:20 AM   3  0.45  3.8000    63211
       9:58:24 AM   4  0.46  3.8300    64251
MPET   9:57:52 AM   3  0.26  1.4200    44600
ORBC   9:59:02 AM   2  0.22  3.4000    10509
SUNH   9:59:09 AM   6  0.09  4.3700    24394
TBET   9:59:05 AM   2  8.03  2.1800  1121629
       9:59:14 AM   3  8.05  2.1900  1124179
XRA    9:58:08 AM   3  0.12  3.6167    42310

note that this created a MultiIndex. Another way I could have gotten this:

In [32]: df.set_index(['Symbol', 'TimeStamp'])
Out[32]: 
                    Price  M1    M2   Volume
Symbol TimeStamp                            
TBET   9:59:14 AM  2.1900   3  8.05  1124179
FUEL   9:59:11 AM  3.9490   9  1.15   109674
SUNH   9:59:09 AM  4.3700   6  0.09    24394
FUEL   9:59:09 AM  3.9099   8  1.11   105265
TBET   9:59:05 AM  2.1800   2  8.03  1121629
ORBC   9:59:02 AM  3.4000   2  0.22    10509
FUEL   9:58:47 AM  3.8599   7  1.07   102116
       9:58:40 AM  3.8544   6  1.05   100116
GBR    9:58:24 AM  3.8300   4  0.46    64251
       9:58:20 AM  3.8000   3  0.45    63211
XRA    9:58:08 AM  3.6167   3  0.12    42310
GBR    9:57:52 AM  3.7500   2  0.34    47521
MPET   9:57:52 AM  1.4200   3  0.26    44600

In [33]: df.set_index(['Symbol', 'TimeStamp']).sortlevel(0)
Out[33]: 
                    Price  M1    M2   Volume
Symbol TimeStamp                            
FUEL   9:58:40 AM  3.8544   6  1.05   100116
       9:58:47 AM  3.8599   7  1.07   102116
       9:59:09 AM  3.9099   8  1.11   105265
       9:59:11 AM  3.9490   9  1.15   109674
GBR    9:57:52 AM  3.7500   2  0.34    47521
       9:58:20 AM  3.8000   3  0.45    63211
       9:58:24 AM  3.8300   4  0.46    64251
MPET   9:57:52 AM  1.4200   3  0.26    44600
ORBC   9:59:02 AM  3.4000   2  0.22    10509
SUNH   9:59:09 AM  4.3700   6  0.09    24394
TBET   9:59:05 AM  2.1800   2  8.03  1121629
       9:59:14 AM  2.1900   3  8.05  1124179
XRA    9:58:08 AM  3.6167   3  0.12    42310

you can get this data in a true panel format like so:

In [35]: df.set_index(['TimeStamp', 'Symbol']).sortlevel(0).to_panel()
Out[35]: 
<class 'pandas.core.panel.Panel'>
Dimensions: 4 (items) x 11 (major) x 7 (minor)
Items: Price to Volume
Major axis: 9:57:52 AM to 9:59:14 AM
Minor axis: FUEL to XRA

In [36]: panel = df.set_index(['TimeStamp', 'Symbol']).sortlevel(0).to_panel()

In [37]: panel['Price']
Out[37]: 
Symbol        FUEL   GBR  MPET  ORBC  SUNH  TBET     XRA
TimeStamp                                               
9:57:52 AM     NaN  3.75  1.42   NaN   NaN   NaN     NaN
9:58:08 AM     NaN   NaN   NaN   NaN   NaN   NaN  3.6167
9:58:20 AM     NaN  3.80   NaN   NaN   NaN   NaN     NaN
9:58:24 AM     NaN  3.83   NaN   NaN   NaN   NaN     NaN
9:58:40 AM  3.8544   NaN   NaN   NaN   NaN   NaN     NaN
9:58:47 AM  3.8599   NaN   NaN   NaN   NaN   NaN     NaN
9:59:02 AM     NaN   NaN   NaN   3.4   NaN   NaN     NaN
9:59:05 AM     NaN   NaN   NaN   NaN   NaN  2.18     NaN
9:59:09 AM  3.9099   NaN   NaN   NaN  4.37   NaN     NaN
9:59:11 AM  3.9490   NaN   NaN   NaN   NaN   NaN     NaN
9:59:14 AM     NaN   NaN   NaN   NaN   NaN  2.19     NaN

you can then generate some plots from that data.

note here that the timestamps are still as strings-- I guess they could be converted to Python datetime.time objects and things might be a bit easier to work with. I don't have many plans to provide a lot of support for raw times vs. timestamps (date + time) but if enough people need it I suppose I can be convinced :)

If you have multiple observations on a second for a single symbol then some of the above methods will not work. But I want to build in better support for that in upcoming releases of pandas, so knowing your use cases will be helpful to me-- consider joining the mailing list (pystatsmodels)

like image 126
Wes McKinney Avatar answered Oct 04 '22 02:10

Wes McKinney