Assume I have a table like below <pre class="prettyprint"><code> A B C B 0 0 1 2 3 1 4 5 6 7 </code></pre> I'd like to drop column B. I tried to use <code>drop_duplicates</code>, but it seems that it only works based on duplicated data not header. Hope anyone know how to do this.

You can <code>groupby</code> We use the <code>axis=1</code> and <code>level=0</code> parameters to specify that we are grouping by columns. Then use the <code>first</code> method to grab the first column within each group defined by unique column names. <pre class="prettyprint"><code>df.groupby(level=0, axis=1).first() A B C 0 0 1 2 1 4 5 6 </code></pre> We could have also used <code>last</code> <pre class="prettyprint"><code>df.groupby(level=0, axis=1).last() A B C 0 0 3 2 1 4 7 6 </code></pre> Or <code>mean</code> <pre class="prettyprint"><code>df.groupby(level=0, axis=1).mean() A B C 0 0 2 2 1 4 6 6 </code></pre>

how to drop duplicated columns data based on column name in pandas

Tags:

pandas

Assume I have a table like below

    A   B   C   B
0   0   1   2   3
1   4   5   6   7

I'd like to drop column B. I tried to use drop_duplicates, but it seems that it only works based on duplicated data not header. Hope anyone know how to do this.

871

asked Jun 15 '17 07:06

X.Z

2 Answers

Use Index.duplicated with loc or iloc and boolean indexing:

print (~df.columns.duplicated())
[ True  True  True False]

df = df.loc[:, ~df.columns.duplicated()]
print (df)
   A  B  C
0  0  1  2
1  4  5  6

df = df.iloc[:, ~df.columns.duplicated()]
print (df)
   A  B  C
0  0  1  2
1  4  5  6

Timings:

np.random.seed(123)
cols = ['A','B','C','B']
#[1000 rows x 30 columns]
df = pd.DataFrame(np.random.randint(10, size=(1000,30)),columns = np.random.choice(cols, 30))
print (df)

In [115]: %timeit (df.groupby(level=0, axis=1).first())
1000 loops, best of 3: 1.48 ms per loop

In [116]: %timeit (df.groupby(level=0, axis=1).mean())
1000 loops, best of 3: 1.58 ms per loop

In [117]: %timeit (df.iloc[:, ~df.columns.duplicated()])
1000 loops, best of 3: 338 µs per loop

In [118]: %timeit (df.loc[:, ~df.columns.duplicated()])
1000 loops, best of 3: 346 µs per loop

enter image description here

153

answered Oct 19 '22 12:10

jezrael

You can groupby
We use the axis=1 and level=0 parameters to specify that we are grouping by columns. Then use the first method to grab the first column within each group defined by unique column names.

df.groupby(level=0, axis=1).first()

   A  B  C
0  0  1  2
1  4  5  6

We could have also used last

df.groupby(level=0, axis=1).last()

   A  B  C
0  0  3  2
1  4  7  6

Or mean

df.groupby(level=0, axis=1).mean()

   A  B  C
0  0  2  2
1  4  6  6

answered Oct 19 '22 10:10

piRSquared

Related questions
                            
                                Python Pandas: detecting frequency of time series
                            
                                pandas df.loc[z,x]=y how to improve speed?
                            
                                How to check if Pandas column has value from list of string?
                            
                                decompose() for time series: ValueError: You must specify a period or x must be a pandas object with a DatetimeIndex with a freq not set to None
                            
                                Error converting object (string) to Int32: TypeError: object cannot be converted to an IntegerDtype
                            
                                How to get boxplot data for matplotlib boxplots
                            
                                python pandas.Series.isin with case insensitive
                            
                                What exactly do the whiskers in pandas' boxplots specify?
                            
                                Pandas: reshaping data
                            
                                Keep finite entries only in Pandas
                            
                                Read Space-separated Data with Pandas [duplicate]
                            
                                In pandas/python, reading array stored as string
                            
                                Plotting a time series?
                            
                                Is pd.get_dummies one-hot encoding?
                            
                                list of columns in common in two pandas dataframes
                            
                                Changing Pipe separated data to Dataframe in Python Pandas
                            
                                how to append/insert an item at the beginning of a series?
                            
                                Python Pandas slice multiindex by second level index (or any other level)
                            
                                how should i read a csv file without the 'unnamed' row with pandas? [duplicate]
                            
                                ValueError: num must be 1 <= num <= 2, not 3

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With