Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: Get duplicated indexes

Given a dataframe, I want to get the duplicated indexes, which do not have duplicate values in the columns, and see which values are different.

Specifically, I have this dataframe:

import pandas as pd wget https://www.dropbox.com/s/vmimze2g4lt4ud3/alt_exon_repeatmasker_intersect.bed alt_exon_repeatmasker = pd.read_table('alt_exon_repeatmasker_intersect.bed', header=None, index_col=3)  In [74]: alt_exon_repeatmasker.index.is_unique Out[74]: False 

And some of the indexes have duplicate values in the 9th column (the type of DNA repetitive element in this location), and I want to know what are the different types of repetitive elements for individual locations (each index = a genome location).

I'm guessing this will require some kind of groupby and hopefully some groupby ninja can help me out.

To simplify even further, if we only have the index and the repeat type,

genome_location1    MIR3 genome_location1    AluJb genome_location2    Tigger1 genome_location3    AT_rich 

So the output I'd like to see all duplicate indexes and their repeat types, as such:

genome_location1    MIR3 genome_location1    AluJb 

EDIT: added toy example

like image 747
Olga Botvinnik Avatar asked Nov 25 '13 17:11

Olga Botvinnik


People also ask

Can pandas index have duplicates?

duplicated() function Indicate duplicate index values. Duplicated values are indicated as True values in the resulting array. Either all duplicates, all except the first, or all except the last occurrence of duplicates can be indicated.

How do I find duplicates in pandas?

The pandas. DataFrame. duplicated() method is used to find duplicate rows in a DataFrame. It returns a boolean series which identifies whether a row is duplicate or unique.


2 Answers

Also useful and very succinct:

df[df.index.duplicated()]

Note that this only returns one of the duplicated rows, so to see all the duplicated rows you'll want this:

df[df.index.duplicated(keep=False)]

like image 161
Matthew Avatar answered Sep 20 '22 12:09

Matthew


df.groupby(level=0).filter(lambda x: len(x) > 1)['type'] 

We added filter method for this kind of operation. You can also use masking and transform for equivalent results, but this is faster, and a little more readable too.

Important:

The filter method was introduced in version 0.12, but it failed to work on DataFrames/Series with nonunique indexes. The issue -- and a related issue with transform on Series -- was fixed for version 0.13, which should be released any day now.

Clearly, nonunique indexes are the heart of this question, so I should point out that this approach will not help until you have pandas 0.13. In the meantime, the transform workaround is the way to go. Be ware that if you try that on a Series with a nonunique index, it too will fail.

There is no good reason why filter and transform should not be applied to nonunique indexes; it was just poorly implemented at first.

like image 41
Dan Allan Avatar answered Sep 21 '22 12:09

Dan Allan