I have a Series s
with duplicate index :
>>> s
STK_ID RPT_Date
600809 20061231 demo_str
20070331 demo_str
20070630 demo_str
20070930 demo_str
20071231 demo_str
20060331 demo_str
20060630 demo_str
20060930 demo_str
20061231 demo_str
20070331 demo_str
20070630 demo_str
Name: STK_Name, Length: 11
And I just want to keep the unique rows and only one copy of the duplicate rows by:
s[s.index.unique()]
Pandas 0.10.1.dev-f7f7e13
give the below error msg
>>> s[s.index.unique()]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "d:\Python27\lib\site-packages\pandas\core\series.py", line 515, in __getitem__
return self._get_with(key)
File "d:\Python27\lib\site-packages\pandas\core\series.py", line 558, in _get_with
return self.reindex(key)
File "d:\Python27\lib\site-packages\pandas\core\series.py", line 2361, in reindex
level=level, limit=limit)
File "d:\Python27\lib\site-packages\pandas\core\index.py", line 2063, in reindex
limit=limit)
File "d:\Python27\lib\site-packages\pandas\core\index.py", line 2021, in get_indexer
raise Exception('Reindexing only valid with uniquely valued Index '
Exception: Reindexing only valid with uniquely valued Index objects
>>>
So how to drop extra duplicate rows of series, keep the unique rows and only one copy of the duplicate rows in an efficient way ? (better in one line)
Indicate duplicate index values. Duplicated values are indicated as True values in the resulting array. Either all duplicates, all except the first, or all except the last occurrence of duplicates can be indicated. The value or values in a set of duplicates to mark as missing.
To drop duplicate columns from pandas DataFrame use df. T. drop_duplicates(). T , this removes all columns that have the same data regardless of column names.
You can groupby the index and apply a function that returns one value per index group. Here, I take the first value:
In [1]: s = Series(range(10), index=[1,2,2,2,5,6,7,7,7,8])
In [2]: s
Out[2]:
1 0
2 1
2 2
2 3
5 4
6 5
7 6
7 7
7 8
8 9
In [3]: s.groupby(s.index).first()
Out[3]:
1 0
2 1
5 4
6 5
7 6
8 9
UPDATE
Addressing BigBug's comment about crashing when passing a MultiIndex to Series.groupby():
In [1]: s
Out[1]:
STK_ID RPT_Date
600809 20061231 demo
20070331 demo
20070630 demo
20070331 demo
In [2]: s.reset_index().groupby(s.index.names).first()
Out[2]:
0
STK_ID RPT_Date
600809 20061231 demo
20070331 demo
20070630 demo
You could subset your data with duplicated
(which keeps first value by default) for index
. With @Zelazny7 example:
s = pd.Series(range(10), index=[1,2,2,2,5,6,7,7,7,8])
In [130]: s[~s.index.duplicated()]
Out[130]:
1 0
2 1
5 4
6 5
7 6
8 9
dtype: int64
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With