Say I have a data table
1 2 3 4 5 6 .. n A x x x x x x .. x B x x x x x x .. x C x x x x x x .. x
And I want to slim it down so that I only have, say, columns 3 and 5 deleting all other and maintaining the structure. How could I do this with pandas? I think I understand how to delete a single column, but I don't know how to save a select few and delete all others.
In Pandas, we can select a single column with just using the index operator [], but without list as argument. However, the resulting object is a Pandas series instead of Pandas Dataframe. For example, if we use df['A'], we would have selected the single column as Pandas Series object.
You can create a new DataFrame of a specific column by using DataFrame. assign() method. The assign() method assign new columns to a DataFrame, returning a new object (a copy) with the new columns added to the original ones.
Select All Except One Column Using drop() Method in pandas In order to remove columns use axis=1 or columns param. For example df. drop("Discount",axis=1) removes Discount column by kepping all other columns untouched. This gives you a DataFrame with all columns with out one unwanted column.
Selecting columns based on their name This is the most basic way to select a single column from a dataframe, just put the string name of the column in brackets. Returns a pandas series. Passing a list in the brackets lets you select multiple columns at the same time.
If you have a list of columns you can just select those:
In [11]: df Out[11]: 1 2 3 4 5 6 A x x x x x x B x x x x x x C x x x x x x In [12]: col_list = [3, 5] In [13]: df = df[col_list] In [14]: df Out[14]: 3 5 A x x B x x C x x
How do I keep certain columns in a pandas DataFrame, deleting everything else?
The answer to this question is the same as the answer to "How do I delete certain columns in a pandas DataFrame?" Here are some additional options to those mentioned so far, along with timings.
DataFrame.loc
One simple option is selection, as mentioned by in other answers,
# Setup. df 1 2 3 4 5 6 A x x x x x x B x x x x x x C x x x x x x cols_to_keep = [3,5]
df[cols_to_keep] 3 5 A x x B x x C x x
Or,
df.loc[:, cols_to_keep] 3 5 A x x B x x C x x
DataFrame.reindex
with axis=1
or 'columns'
(0.21+)However, we also have reindex
, in recent versions you specify axis=1
to drop:
df.reindex(cols_to_keep, axis=1) # df.reindex(cols_to_keep, axis='columns') # for versions < 0.21, use # df.reindex(columns=cols_to_keep) 3 5 A x x B x x C x x
On older versions, you can also use reindex_axis
: df.reindex_axis(cols_to_keep, axis=1)
.
DataFrame.drop
Another alternative is to use drop
to select columns by pd.Index.difference
:
# df.drop(cols_to_drop, axis=1) df.drop(df.columns.difference(cols_to_keep), axis=1) 3 5 A x x B x x C x x
The methods are roughly the same in terms of performance; reindex
is faster for smaller N, while drop
is faster for larger N. The performance is relative as the Y-axis is logarithmic.
Setup and Code
import pandas as pd import perfplot def make_sample(n): np.random.seed(0) df = pd.DataFrame(np.full((n, n), 'x')) cols_to_keep = np.random.choice(df.columns, max(2, n // 4), replace=False) return df, cols_to_keep perfplot.show( setup=lambda n: make_sample(n), kernels=[ lambda inp: inp[0][inp[1]], lambda inp: inp[0].loc[:, inp[1]], lambda inp: inp[0].reindex(inp[1], axis=1), lambda inp: inp[0].drop(inp[0].columns.difference(inp[1]), axis=1) ], labels=['__getitem__', 'loc', 'reindex', 'drop'], n_range=[2**k for k in range(2, 13)], xlabel='N', logy=True, equality_check=lambda x, y: (x.reindex_like(y) == y).values.all() )
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With