Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to drop extra copy of duplicate index of Pandas Series?

Tags:

python

pandas

I have a Series s with duplicate index :

>>> s
STK_ID  RPT_Date
600809  20061231    demo_str
        20070331    demo_str
        20070630    demo_str
        20070930    demo_str
        20071231    demo_str
        20060331    demo_str
        20060630    demo_str
        20060930    demo_str
        20061231    demo_str
        20070331    demo_str
        20070630    demo_str
Name: STK_Name, Length: 11

And I just want to keep the unique rows and only one copy of the duplicate rows by:

s[s.index.unique()]

Pandas 0.10.1.dev-f7f7e13 give the below error msg

>>> s[s.index.unique()]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "d:\Python27\lib\site-packages\pandas\core\series.py", line 515, in __getitem__
    return self._get_with(key)
  File "d:\Python27\lib\site-packages\pandas\core\series.py", line 558, in _get_with
    return self.reindex(key)
  File "d:\Python27\lib\site-packages\pandas\core\series.py", line 2361, in reindex
    level=level, limit=limit)
  File "d:\Python27\lib\site-packages\pandas\core\index.py", line 2063, in reindex
    limit=limit)
  File "d:\Python27\lib\site-packages\pandas\core\index.py", line 2021, in get_indexer
    raise Exception('Reindexing only valid with uniquely valued Index '
Exception: Reindexing only valid with uniquely valued Index objects
>>> 

So how to drop extra duplicate rows of series, keep the unique rows and only one copy of the duplicate rows in an efficient way ? (better in one line)

like image 646
bigbug Avatar asked Jan 18 '13 09:01

bigbug


People also ask

Can Pandas index have duplicates?

Indicate duplicate index values. Duplicated values are indicated as True values in the resulting array. Either all duplicates, all except the first, or all except the last occurrence of duplicates can be indicated. The value or values in a set of duplicates to mark as missing.

How do I remove duplicate columns in Pandas?

To drop duplicate columns from pandas DataFrame use df. T. drop_duplicates(). T , this removes all columns that have the same data regardless of column names.


2 Answers

You can groupby the index and apply a function that returns one value per index group. Here, I take the first value:

In [1]: s = Series(range(10), index=[1,2,2,2,5,6,7,7,7,8])

In [2]: s
Out[2]:
1    0
2    1
2    2
2    3
5    4
6    5
7    6
7    7
7    8
8    9

In [3]: s.groupby(s.index).first()
Out[3]:
1    0
2    1
5    4
6    5
7    6
8    9

UPDATE

Addressing BigBug's comment about crashing when passing a MultiIndex to Series.groupby():

In [1]: s
Out[1]:
STK_ID  RPT_Date
600809  20061231    demo
        20070331    demo
        20070630    demo
        20070331    demo

In [2]: s.reset_index().groupby(s.index.names).first()
Out[2]:
                    0
STK_ID RPT_Date
600809 20061231  demo
       20070331  demo
       20070630  demo
like image 191
Zelazny7 Avatar answered Oct 25 '22 14:10

Zelazny7


You could subset your data with duplicated (which keeps first value by default) for index. With @Zelazny7 example:

s = pd.Series(range(10), index=[1,2,2,2,5,6,7,7,7,8])

In [130]: s[~s.index.duplicated()]
Out[130]: 
1    0
2    1
5    4
6    5
7    6
8    9
dtype: int64
like image 43
Anton Protopopov Avatar answered Oct 25 '22 16:10

Anton Protopopov