Pandas: Get duplicated indexes

Tags:

Given a dataframe, I want to get the duplicated indexes, which do not have duplicate values in the columns, and see which values are different.

Specifically, I have this dataframe:

import pandas as pd wget https://www.dropbox.com/s/vmimze2g4lt4ud3/alt_exon_repeatmasker_intersect.bed alt_exon_repeatmasker = pd.read_table('alt_exon_repeatmasker_intersect.bed', header=None, index_col=3)  In [74]: alt_exon_repeatmasker.index.is_unique Out[74]: False

And some of the indexes have duplicate values in the 9th column (the type of DNA repetitive element in this location), and I want to know what are the different types of repetitive elements for individual locations (each index = a genome location).

I'm guessing this will require some kind of groupby and hopefully some groupby ninja can help me out.

To simplify even further, if we only have the index and the repeat type,

genome_location1    MIR3 genome_location1    AluJb genome_location2    Tigger1 genome_location3    AT_rich

So the output I'd like to see all duplicate indexes and their repeat types, as such:

genome_location1    MIR3 genome_location1    AluJb

EDIT: added toy example

747

asked Nov 25 '13 17:11

Olga Botvinnik

2 Answers

Also useful and very succinct:

df[df.index.duplicated()]

Note that this only returns one of the duplicated rows, so to see all the duplicated rows you'll want this:

df[df.index.duplicated(keep=False)]

161

answered Sep 20 '22 12:09

Matthew

df.groupby(level=0).filter(lambda x: len(x) > 1)['type']

We added filter method for this kind of operation. You can also use masking and transform for equivalent results, but this is faster, and a little more readable too.

Important:

The filter method was introduced in version 0.12, but it failed to work on DataFrames/Series with nonunique indexes. The issue -- and a related issue with transform on Series -- was fixed for version 0.13, which should be released any day now.

Clearly, nonunique indexes are the heart of this question, so I should point out that this approach will not help until you have pandas 0.13. In the meantime, the transform workaround is the way to go. Be ware that if you try that on a Series with a nonunique index, it too will fail.

There is no good reason why filter and transform should not be applied to nonunique indexes; it was just poorly implemented at first.

answered Sep 21 '22 12:09

Dan Allan

Related questions
                            
                                How to get unique values with respective occurrence count from a list in Python?
                            
                                Changing plot scale by a factor in matplotlib
                            
                                Do you use Python mostly for its functional or object-oriented features? [closed]
                            
                                initialize dict with keys,values from two list [duplicate]
                            
                                How To Get Latitude & Longitude with python
                            
                                Why is subtraction faster than addition in Python?
                            
                                How to run " ps cax | grep something " in Python?
                            
                                Get an object attribute [duplicate]
                            
                                Pandas sum two columns, skipping NaN
                            
                                python flask redirect to https from http
                            
                                How to get file extension correctly?
                            
                                Python list slice syntax used for no obvious reason
                            
                                Iterating through a multidimensional array in Python
                            
                                Eclipse Organize Imports Shortcut (Ctrl+Shift+O) is not working
                            
                                Fastest way to zero out low values in array?
                            
                                Django: how to get format date in views?
                            
                                Django : Table doesn't exist
                            
                                Pandas Series of lists to one series
                            
                                Python min function with a list of objects
                            
                                averaging list of lists python column-wise

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas: Get duplicated indexes

Tags:

python

indexing

pandas

Olga Botvinnik

People also ask

2 Answers

Matthew

Dan Allan

Recent Activity

Donate For Us