Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: Get all columns that have constant value

Tags:

python

pandas

I want to get the names of the columns which have same values across all rows for each column.

My data:

   A   B  C  D
0  1  hi  2  a
1  3  hi  2  b
2  4  hi  2  c

Desired output:

['B', 'C']

Code:

import pandas as pd

d = {'A': [1,3,4], 'B': ['hi','hi','hi'], 'C': [2,2,2], 'D': ['a','b','c']}
df = pd.DataFrame(data=d)

I've been playing around with df.columns and .any(), but can't figure out how to do this.

like image 800
tbienias Avatar asked May 29 '18 10:05

tbienias


People also ask

How to select rows in a pandas Dataframe based on column values?

You can use one of the following methods to select rows in a pandas DataFrame based on column values: df.loc[df ['col1'].isin( [value1, value2, value3, ...])] The following example shows how to use each method with the following pandas DataFrame:

What are the components of pandas Dataframe?

Pandas DataFrame consists of three principal components, the data, rows, and columns. Column in DataFrame : In Order to pick a column in Pandas DataFrame, we will either access the columns by calling them by their columns name.

How to get all values of a column in a list?

This can be very useful in many situations, suppose we have to get marks of all the students in a particular subject, get phone numbers of all employees, etc. Let’s see how we can achieve this with the help of some examples. Example 1: We can have all values of a column in a list, by using the tolist () method.

How to select all rows that contain the value 25 in Dataframe?

The following syntax shows how to select all rows of the DataFrame that contain the value 25 in any of the columns: df [df.isin( [25]).any(axis=1)] points assists rebounds 0 25 5 11 The following syntax shows how to select all rows of the DataFrame that contain the values 25, 9, or 6 in any of the columns:


Video Answer


2 Answers

Use the pandas not-so-well-known builtin nunique():

df.columns[df.nunique() <= 1]
Index(['B', 'C'], dtype='object')

Notes:

  • Use nunique(dropna=False) option if you want na's counted as a separate value
  • It's the cleanest code, but not the fastest. (But in general code should prioritize clarity and readability).
like image 152
smci Avatar answered Oct 09 '22 12:10

smci


Solution 1:

c = [c for c in df.columns if len(set(df[c])) == 1]
print (c)

['B', 'C']

Solution 2:

c = df.columns[df.eq(df.iloc[0]).all()].tolist()
print (c)
['B', 'C']

Explanation for Solution 2:

First compare all rows to the first row with DataFrame.eq...

print (df.eq(df.iloc[0]))
       A     B     C      D
0   True  True  True   True
1  False  True  True  False
2  False  True  True  False

... then check each column is all Trues with DataFrame.all...

print (df.eq(df.iloc[0]).all())
A    False
B     True
C     True
D    False
dtype: bool

... finally filter columns' names for which result is True:

print (df.columns[df.eq(df.iloc[0]).all()])
Index(['B', 'C'], dtype='object')

Timings:

np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(1000,100)))

df[np.random.randint(100, size=20)] = 100
print (df)

# Solution 1 (second-fastest):
In [243]: %timeit ([c for c in df.columns if len(set(df[c])) == 1])
3.59 ms ± 43.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# Solution 2 (fastest):
In [244]: %timeit df.columns[df.eq(df.iloc[0]).all()].tolist()
1.62 ms ± 13.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

#Mohamed Thasin ah solution
In [245]: %timeit ([col for col in df.columns if len(df[col].unique())==1])
6.8 ms ± 352 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

#jpp solution
In [246]: %%timeit
     ...: vals = df.apply(set, axis=0)
     ...: res = vals[vals.map(len) == 1].index
     ...: 
5.59 ms ± 64.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

#smci solution 1
In [275]: %timeit df.columns[ df.nunique()==1 ]
11 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

#smci solution 2
In [276]: %timeit [col for col in df.columns if not df[col].is_unique]
9.25 ms ± 80 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

#smci solution 3
In [277]: %timeit df.columns[ df.apply(lambda col: not col.is_unique) ]
11.1 ms ± 511 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
like image 26
jezrael Avatar answered Oct 09 '22 13:10

jezrael