Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I delete a column that contains only zeros in Pandas?

Tags:

python

pandas

I currently have a dataframe consisting of columns with 1's and 0's as values, I would like to iterate through the columns and delete the ones that are made up of only 0's. Here's what I have tried so far:

ones = [] zeros = [] for year in years:     for i in range(0,599):         if year[str(i)].values.any() == 1:             ones.append(i)         if year[str(i)].values.all() == 0:             zeros.append(i)     for j in ones:         if j in zeros:             zeros.remove(j)     for q in zeros:         del year[str(q)] 

In which years is a list of dataframes for the various years I am analyzing, ones consists of columns with a one in them and zeros is a list of columns containing all zeros. Is there a better way to delete a column based on a condition? For some reason I have to check whether the ones columns are in the zeros list as well and remove them from the zeros list to obtain a list of all the zero columns.

like image 616
user2587593 Avatar asked Jan 16 '14 14:01

user2587593


People also ask

How do I delete zero rows in pandas?

By default axis = 0 meaning to remove rows. Use axis=1 or columns param to remove columns. By default, pandas return a copy DataFrame after deleting rows, use inpalce=True to remove from existing referring DataFrame.

How do I delete a categorical column in pandas?

The Pandas drop() function in Python is used to drop specified labels from rows and columns. Drop is a major function used in data science & Machine Learning to clean the dataset. Pandas Drop() function removes specified labels from rows or columns.


2 Answers

df.loc[:, (df != 0).any(axis=0)] 

Here is a break-down of how it works:

In [74]: import pandas as pd  In [75]: df = pd.DataFrame([[1,0,0,0], [0,0,1,0]])  In [76]: df Out[76]:     0  1  2  3 0  1  0  0  0 1  0  0  1  0  [2 rows x 4 columns] 

df != 0 creates a boolean DataFrame which is True where df is nonzero:

In [77]: df != 0 Out[77]:         0      1      2      3 0   True  False  False  False 1  False  False   True  False  [2 rows x 4 columns] 

(df != 0).any(axis=0) returns a boolean Series indicating which columns have nonzero entries. (The any operation aggregates values along the 0-axis -- i.e. along the rows -- into a single boolean value. Hence the result is one boolean value for each column.)

In [78]: (df != 0).any(axis=0) Out[78]:  0     True 1    False 2     True 3    False dtype: bool 

And df.loc can be used to select those columns:

In [79]: df.loc[:, (df != 0).any(axis=0)] Out[79]:     0  2 0  1  0 1  0  1  [2 rows x 2 columns] 

To "delete" the zero-columns, reassign df:

df = df.loc[:, (df != 0).any(axis=0)] 
like image 152
unutbu Avatar answered Oct 12 '22 12:10

unutbu


Here is an alternative way to use is

df.replace(0,np.nan).dropna(axis=1,how="all")

Compared with the solution of unutbu, this way is obviously slower:

%timeit df.loc[:, (df != 0).any(axis=0)] 652 µs ± 5.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)  %timeit df.replace(0,np.nan).dropna(axis=1,how="all") 1.75 ms ± 9.49 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 
like image 26
Jeremy Z Avatar answered Oct 12 '22 13:10

Jeremy Z