Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to drop duplicated columns data based on column name in pandas

Tags:

pandas

Assume I have a table like below

    A   B   C   B
0   0   1   2   3
1   4   5   6   7

I'd like to drop column B. I tried to use drop_duplicates, but it seems that it only works based on duplicated data not header. Hope anyone know how to do this.

like image 871
X.Z Avatar asked Jun 15 '17 07:06

X.Z


People also ask

How do you drop duplicates in Pandas with conditions?

By using pandas. DataFrame. drop_duplicates() method you can drop/remove/delete duplicate rows from DataFrame. Using this method you can drop duplicate rows on selected multiple columns or all columns.

How do you find duplicates in a DataFrame column?

Find duplicate columns in a DataFrame To find these duplicate columns we need to iterate over DataFrame column wise and for every column it will search if any other column exists in DataFrame with same contents. If yes then then that column name will be stored in duplicate column list.

Can you have duplicate columns in Pandas?

Pandas, however, can be tricked into allowing duplicate column names. Duplicate column names are a problem if you plan to transfer your data set to another statistical language. They're also a problem because it will cause unanticipated and sometimes difficult to debug problems in Python.


2 Answers

Use Index.duplicated with loc or iloc and boolean indexing:

print (~df.columns.duplicated())
[ True  True  True False]

df = df.loc[:, ~df.columns.duplicated()]
print (df)
   A  B  C
0  0  1  2
1  4  5  6

df = df.iloc[:, ~df.columns.duplicated()]
print (df)
   A  B  C
0  0  1  2
1  4  5  6

Timings:

np.random.seed(123)
cols = ['A','B','C','B']
#[1000 rows x 30 columns]
df = pd.DataFrame(np.random.randint(10, size=(1000,30)),columns = np.random.choice(cols, 30))
print (df)

In [115]: %timeit (df.groupby(level=0, axis=1).first())
1000 loops, best of 3: 1.48 ms per loop

In [116]: %timeit (df.groupby(level=0, axis=1).mean())
1000 loops, best of 3: 1.58 ms per loop

In [117]: %timeit (df.iloc[:, ~df.columns.duplicated()])
1000 loops, best of 3: 338 µs per loop

In [118]: %timeit (df.loc[:, ~df.columns.duplicated()])
1000 loops, best of 3: 346 µs per loop

enter image description here

enter image description here

like image 153
jezrael Avatar answered Oct 19 '22 12:10

jezrael


You can groupby
We use the axis=1 and level=0 parameters to specify that we are grouping by columns. Then use the first method to grab the first column within each group defined by unique column names.

df.groupby(level=0, axis=1).first()

   A  B  C
0  0  1  2
1  4  5  6

We could have also used last

df.groupby(level=0, axis=1).last()

   A  B  C
0  0  3  2
1  4  7  6

Or mean

df.groupby(level=0, axis=1).mean()

   A  B  C
0  0  2  2
1  4  6  6
like image 5
piRSquared Avatar answered Oct 19 '22 10:10

piRSquared