I want to do a quick and easy check if all column values for <code>counts</code> are the same in a dataframe: In: <pre class="prettyprint"><code>import pandas as pd d = {'names': ['Jim', 'Ted', 'Mal', 'Ted'], 'counts': [3, 4, 3, 3]} pd.DataFrame(data=d) </code></pre> Out: <pre class="prettyprint"><code> names counts 0 Jim 3 1 Ted 4 2 Mal 3 3 Ted 3 </code></pre> I want just a simple condition that <code>if all counts = same value</code> then <code>print('True')</code>. Is there a fast way to do this?

An efficient way to do this is by comparing the first value with the rest, and using <code>all</code>: <pre class="prettyprint"><code>def is_unique(s): a = s.to_numpy() # s.values (pandas<0.24) return (a[0] == a).all() is_unique(df['counts']) # False </code></pre> Although the most intuitive idea could possibly be to count the amount of <code>unique</code> values and check if there is only one, this would have a needlessly high complexity for what we're trying to do. Numpy's' <code>np.unique</code>, called by pandas' <code>nunique</code>, implements a sorting of the underlying arrays, which has an evarage complexity of <code>O(n·log(n))</code> using quicksort (default). The above approach is <code>O(n)</code>. The difference in performance becomes more obvious when we're applying this to an entire dataframe (see below). <hr> <h3>For an entire dataframe</h3> In the case of wanting to perform the same task on an entire dataframe, we can extend the above by setting <code>axis=0</code> in <code>all</code>: <pre class="prettyprint"><code>def unique_cols(df): a = df.to_numpy() # df.values (pandas<0.24) return (a[0] == a).all(0) </code></pre> For the shared example, we'd get: <pre class="prettyprint"><code>unique_cols(df) # array([False, False]) </code></pre> <hr> Here's a benchmark of the above methods compared with some other approaches, such as using <code>nunique</code> (for a <code>pd.Series</code>): <pre class="prettyprint"><code>s_num = pd.Series(np.random.randint(0, 1_000, 1_100_000)) perfplot.show( setup=lambda n: s_num.iloc[:int(n)], kernels=[ lambda s: s.nunique() == 1, lambda s: is_unique(s) ], labels=['nunique', 'first_vs_rest'], n_range=[2**k for k in range(0, 20)], xlabel='N' ) </code></pre> <img src="https://i.stack.imgur.com/76i7y.png" alt="enter image description here"> <hr> And below are the timings for a <code>pd.DataFrame</code>. Let's compare too with a <code>numba</code> approach, which is especially useful here since we can take advantage of short-cutting as soon as we see a repeated value in a given column (note: the numba approach will only work with numerical data): <pre class="prettyprint"><code>from numba import njit @njit def unique_cols_nb(a): n_cols = a.shape[1] out = np.zeros(n_cols, dtype=np.int32) for i in range(n_cols): init = a[0, i] for j in a[1:, i]: if j != init: break else: out[i] = 1 return out </code></pre> If we compare the three methods: <pre class="prettyprint"><code>df = pd.DataFrame(np.concatenate([np.random.randint(0, 1_000, (500_000, 200)), np.zeros((500_000, 10))], axis=1)) perfplot.show( setup=lambda n: df.iloc[:int(n),:], kernels=[ lambda df: (df.nunique(0) == 1).values, lambda df: unique_cols_nb(df.values).astype(bool), lambda df: unique_cols(df) ], labels=['nunique', 'unique_cols_nb', 'unique_cols'], n_range=[2**k for k in range(0, 20)], xlabel='N' ) </code></pre> <img src="https://i.stack.imgur.com/ofIuA.png" alt="enter image description here">

Update using <code>np.unique</code> <pre class="prettyprint"><code>len(np.unique(df.counts))==1 False </code></pre> Or <pre class="prettyprint"><code>len(set(df.counts.tolist()))==1 </code></pre> Or <pre class="prettyprint"><code>df.counts.eq(df.counts.iloc[0]).all() False </code></pre> Or <pre class="prettyprint"><code>df.counts.std()==0 False </code></pre>

I prefer: <pre class="prettyprint"><code>df['counts'].eq(df['counts'].iloc[0]).all() </code></pre> I find it the easiest to read and it works across all value types. I have also find it fast enough in my experience.

Check if all values in dataframe column are the same

Tags:

python

python-3.x

pandas

dataframe

I want to do a quick and easy check if all column values for counts are the same in a dataframe:

In:

import pandas as pd

d = {'names': ['Jim', 'Ted', 'Mal', 'Ted'], 'counts': [3, 4, 3, 3]}
pd.DataFrame(data=d)

Out:

  names  counts
0   Jim       3
1   Ted       4
2   Mal       3
3   Ted       3

I want just a simple condition that if all counts = same value then print('True').

Is there a fast way to do this?

440

asked Jan 28 '19 15:01

HelloToEarth

Video Answer

3 Answers

An efficient way to do this is by comparing the first value with the rest, and using all:

def is_unique(s):
    a = s.to_numpy() # s.values (pandas<0.24)
    return (a[0] == a).all()

is_unique(df['counts'])
# False

Although the most intuitive idea could possibly be to count the amount of unique values and check if there is only one, this would have a needlessly high complexity for what we're trying to do. Numpy's' np.unique, called by pandas' nunique, implements a sorting of the underlying arrays, which has an evarage complexity of O(n·log(n)) using quicksort (default). The above approach is O(n).

The difference in performance becomes more obvious when we're applying this to an entire dataframe (see below).

For an entire dataframe

In the case of wanting to perform the same task on an entire dataframe, we can extend the above by setting axis=0 in all:

def unique_cols(df):
    a = df.to_numpy() # df.values (pandas<0.24)
    return (a[0] == a).all(0)

For the shared example, we'd get:

unique_cols(df)
# array([False, False])

Here's a benchmark of the above methods compared with some other approaches, such as using nunique (for a pd.Series):

s_num = pd.Series(np.random.randint(0, 1_000, 1_100_000))

perfplot.show(
    setup=lambda n: s_num.iloc[:int(n)], 

    kernels=[
        lambda s: s.nunique() == 1,
        lambda s: is_unique(s)
    ],

    labels=['nunique', 'first_vs_rest'],
    n_range=[2**k for k in range(0, 20)],
    xlabel='N'
)

enter image description here

And below are the timings for a pd.DataFrame. Let's compare too with a numba approach, which is especially useful here since we can take advantage of short-cutting as soon as we see a repeated value in a given column (note: the numba approach will only work with numerical data):

from numba import njit

@njit
def unique_cols_nb(a):
    n_cols = a.shape[1]
    out = np.zeros(n_cols, dtype=np.int32)
    for i in range(n_cols):
        init = a[0, i]
        for j in a[1:, i]:
            if j != init:
                break
        else:
            out[i] = 1
    return out

If we compare the three methods:

df = pd.DataFrame(np.concatenate([np.random.randint(0, 1_000, (500_000, 200)), 
                                  np.zeros((500_000, 10))], axis=1))

perfplot.show(
    setup=lambda n: df.iloc[:int(n),:], 

    kernels=[
        lambda df: (df.nunique(0) == 1).values,
        lambda df: unique_cols_nb(df.values).astype(bool),
        lambda df: unique_cols(df) 
    ],

    labels=['nunique', 'unique_cols_nb', 'unique_cols'],
    n_range=[2**k for k in range(0, 20)],
    xlabel='N'
)

enter image description here

166

answered Oct 25 '22 18:10

yatu

Update using np.unique

len(np.unique(df.counts))==1
False

len(set(df.counts.tolist()))==1

df.counts.eq(df.counts.iloc[0]).all()
False

df.counts.std()==0
False

answered Oct 25 '22 19:10

BENY

I prefer:

df['counts'].eq(df['counts'].iloc[0]).all()

I find it the easiest to read and it works across all value types. I have also find it fast enough in my experience.