Assume I have a table like below
A B C B
0 0 1 2 3
1 4 5 6 7
I'd like to drop column B. I tried to use drop_duplicates
, but it seems that it only works based on duplicated data not header.
Hope anyone know how to do this.
By using pandas. DataFrame. drop_duplicates() method you can drop/remove/delete duplicate rows from DataFrame. Using this method you can drop duplicate rows on selected multiple columns or all columns.
Find duplicate columns in a DataFrame To find these duplicate columns we need to iterate over DataFrame column wise and for every column it will search if any other column exists in DataFrame with same contents. If yes then then that column name will be stored in duplicate column list.
Pandas, however, can be tricked into allowing duplicate column names. Duplicate column names are a problem if you plan to transfer your data set to another statistical language. They're also a problem because it will cause unanticipated and sometimes difficult to debug problems in Python.
Use Index.duplicated
with loc
or iloc
and boolean indexing
:
print (~df.columns.duplicated())
[ True True True False]
df = df.loc[:, ~df.columns.duplicated()]
print (df)
A B C
0 0 1 2
1 4 5 6
df = df.iloc[:, ~df.columns.duplicated()]
print (df)
A B C
0 0 1 2
1 4 5 6
Timings:
np.random.seed(123)
cols = ['A','B','C','B']
#[1000 rows x 30 columns]
df = pd.DataFrame(np.random.randint(10, size=(1000,30)),columns = np.random.choice(cols, 30))
print (df)
In [115]: %timeit (df.groupby(level=0, axis=1).first())
1000 loops, best of 3: 1.48 ms per loop
In [116]: %timeit (df.groupby(level=0, axis=1).mean())
1000 loops, best of 3: 1.58 ms per loop
In [117]: %timeit (df.iloc[:, ~df.columns.duplicated()])
1000 loops, best of 3: 338 µs per loop
In [118]: %timeit (df.loc[:, ~df.columns.duplicated()])
1000 loops, best of 3: 346 µs per loop
You can groupby
We use the axis=1
and level=0
parameters to specify that we are grouping by columns. Then use the first
method to grab the first column within each group defined by unique column names.
df.groupby(level=0, axis=1).first()
A B C
0 0 1 2
1 4 5 6
We could have also used last
df.groupby(level=0, axis=1).last()
A B C
0 0 3 2
1 4 7 6
Or mean
df.groupby(level=0, axis=1).mean()
A B C
0 0 2 2
1 4 6 6
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With