Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Keep certain columns in a pandas DataFrame, deleting everything else

Tags:

python

pandas

Say I have a data table

    1  2  3  4  5  6 ..  n A   x  x  x  x  x  x ..  x B   x  x  x  x  x  x ..  x C   x  x  x  x  x  x ..  x 

And I want to slim it down so that I only have, say, columns 3 and 5 deleting all other and maintaining the structure. How could I do this with pandas? I think I understand how to delete a single column, but I don't know how to save a select few and delete all others.

like image 447
Matt Avatar asked May 17 '13 19:05

Matt


People also ask

How do I keep only one column in pandas?

In Pandas, we can select a single column with just using the index operator [], but without list as argument. However, the resulting object is a Pandas series instead of Pandas Dataframe. For example, if we use df['A'], we would have selected the single column as Pandas Series object.

How do I make a data frame with only certain columns?

You can create a new DataFrame of a specific column by using DataFrame. assign() method. The assign() method assign new columns to a DataFrame, returning a new object (a copy) with the new columns added to the original ones.

How do you drop all columns except some in pandas?

Select All Except One Column Using drop() Method in pandas In order to remove columns use axis=1 or columns param. For example df. drop("Discount",axis=1) removes Discount column by kepping all other columns untouched. This gives you a DataFrame with all columns with out one unwanted column.

How do I select columns to keep in pandas?

Selecting columns based on their name This is the most basic way to select a single column from a dataframe, just put the string name of the column in brackets. Returns a pandas series. Passing a list in the brackets lets you select multiple columns at the same time.


2 Answers

If you have a list of columns you can just select those:

In [11]: df Out[11]:    1  2  3  4  5  6 A  x  x  x  x  x  x B  x  x  x  x  x  x C  x  x  x  x  x  x  In [12]: col_list = [3, 5]  In [13]: df = df[col_list]  In [14]: df Out[14]:    3  5 A  x  x B  x  x C  x  x 
like image 190
Andy Hayden Avatar answered Oct 06 '22 07:10

Andy Hayden


How do I keep certain columns in a pandas DataFrame, deleting everything else?

The answer to this question is the same as the answer to "How do I delete certain columns in a pandas DataFrame?" Here are some additional options to those mentioned so far, along with timings.

DataFrame.loc

One simple option is selection, as mentioned by in other answers,

# Setup. df    1  2  3  4  5  6 A  x  x  x  x  x  x B  x  x  x  x  x  x C  x  x  x  x  x  x  cols_to_keep = [3,5] 

df[cols_to_keep]     3  5 A  x  x B  x  x C  x  x 

Or,

df.loc[:, cols_to_keep]     3  5 A  x  x B  x  x C  x  x 

DataFrame.reindex with axis=1 or 'columns' (0.21+)

However, we also have reindex, in recent versions you specify axis=1 to drop:

df.reindex(cols_to_keep, axis=1) # df.reindex(cols_to_keep, axis='columns')  # for versions < 0.21, use # df.reindex(columns=cols_to_keep)     3  5 A  x  x B  x  x C  x  x 

On older versions, you can also use reindex_axis: df.reindex_axis(cols_to_keep, axis=1).


DataFrame.drop

Another alternative is to use drop to select columns by pd.Index.difference:

# df.drop(cols_to_drop, axis=1) df.drop(df.columns.difference(cols_to_keep), axis=1)     3  5 A  x  x B  x  x C  x  x 

Performance

enter image description here

The methods are roughly the same in terms of performance; reindex is faster for smaller N, while drop is faster for larger N. The performance is relative as the Y-axis is logarithmic.

Setup and Code

import pandas as pd import perfplot  def make_sample(n):     np.random.seed(0)     df = pd.DataFrame(np.full((n, n), 'x'))     cols_to_keep = np.random.choice(df.columns, max(2, n // 4), replace=False)      return df, cols_to_keep   perfplot.show(     setup=lambda n: make_sample(n),     kernels=[         lambda inp: inp[0][inp[1]],         lambda inp: inp[0].loc[:, inp[1]],         lambda inp: inp[0].reindex(inp[1], axis=1),         lambda inp: inp[0].drop(inp[0].columns.difference(inp[1]), axis=1)     ],     labels=['__getitem__', 'loc', 'reindex', 'drop'],     n_range=[2**k for k in range(2, 13)],     xlabel='N',        logy=True,     equality_check=lambda x, y: (x.reindex_like(y) == y).values.all() ) 
like image 40
cs95 Avatar answered Oct 06 '22 05:10

cs95