Using pandas, I would like to get count of a specific value in a column.I know using df.somecolumn.ravel() will give me all the unique values and their count.But how to get count of some specific value. <pre class="prettyprint"><code>In[5]:df Out[5]: col 1 1 1 1 2 2 2 1 </code></pre> Desired : <pre class="prettyprint"><code> To get count of 1. In[6]:df.somecalulation(1) Out[6]: 5 To get count of 2. In[6]:df.somecalulation(2) Out[6]: 3 </code></pre>

You can try <code>value_counts</code>: <pre class="prettyprint"><code>df = df['col'].value_counts().reset_index() df.columns = ['col', 'count'] print df col count 0 1 5 1 2 3 </code></pre> EDIT: <pre class="prettyprint"><code>print (df['col'] == 1).sum() 5 </code></pre> Or: <pre class="prettyprint"><code>def somecalulation(x): return (df['col'] == x).sum() print somecalulation(1) 5 print somecalulation(2) 3 </code></pre> Or: <pre class="prettyprint"><code>ser = df['col'].value_counts() def somecalulation(s, x): return s[x] print somecalulation(ser, 1) 5 print somecalulation(ser, 2) 3 </code></pre> EDIT2: If you need something really fast, use <code>numpy.in1d</code>: <pre class="prettyprint"><code>import pandas as pd import numpy as np a = pd.Series([1, 1, 1, 1, 2, 2]) #for testing len(a) = 6000 a = pd.concat([a]*1000).reset_index(drop=True) print np.in1d(a,1).sum() 4000 print (a == 1).sum() 4000 print np.sum(a==1) 4000 </code></pre> Timings: <code>len(a)=6</code>: <pre class="prettyprint"><code>In [131]: %timeit np.in1d(a,1).sum() The slowest run took 9.17 times longer than the fastest. This could mean that an intermediate result is being cached 10000 loops, best of 3: 29.9 µs per loop In [132]: %timeit np.sum(a == 1) 10000 loops, best of 3: 196 µs per loop In [133]: %timeit (a == 1).sum() 1000 loops, best of 3: 180 µs per loop </code></pre> <code>len(a)=6000</code>: <pre class="prettyprint"><code>In [135]: %timeit np.in1d(a,1).sum() The slowest run took 7.29 times longer than the fastest. This could mean that an intermediate result is being cached 10000 loops, best of 3: 48.5 µs per loop In [136]: %timeit np.sum(a == 1) The slowest run took 5.23 times longer than the fastest. This could mean that an intermediate result is being cached 1000 loops, best of 3: 273 µs per loop In [137]: %timeit (a == 1).sum() 1000 loops, best of 3: 271 µs per loop </code></pre>

If you take the <code>value_counts</code> return, you can query it for multiple values: <pre class="prettyprint"><code>import pandas as pd a = pd.Series([1, 1, 1, 1, 2, 2]) counts = a.value_counts() >>> counts[1], counts[2] (4, 2) </code></pre> However, to count only a single item, it would be faster to use <pre class="prettyprint"><code>import numpy as np np.sum(a == 1) </code></pre>

Pandas, Get count of a single value in a Column of a Dataframe

Tags:

python

pandas

Using pandas, I would like to get count of a specific value in a column.I know using df.somecolumn.ravel() will give me all the unique values and their count.But how to get count of some specific value.

Desired :

  To get count of 1.

  In[6]:df.somecalulation(1)
  Out[6]: 5

  To get count of 2.

  In[6]:df.somecalulation(2)
  Out[6]: 3

274

asked Mar 17 '16 17:03

Randhawa

2 Answers

You can try value_counts:

df = df['col'].value_counts().reset_index()
df.columns = ['col', 'count']
print df
   col  count
0    1      5
1    2      3

EDIT:

print (df['col'] == 1).sum()
5

Or:

def somecalulation(x):
    return (df['col'] == x).sum()

print somecalulation(1)
5
print somecalulation(2)
3

Or:

ser = df['col'].value_counts()

def somecalulation(s, x):
    return s[x]

print somecalulation(ser, 1)
5
print somecalulation(ser, 2)
3

EDIT2:

If you need something really fast, use numpy.in1d:

import pandas as pd
import numpy as np

a = pd.Series([1, 1, 1, 1, 2, 2])

#for testing len(a) = 6000
a = pd.concat([a]*1000).reset_index(drop=True)

print np.in1d(a,1).sum()
4000
print (a == 1).sum()
4000
print np.sum(a==1)
4000

Timings:

len(a)=6:

In [131]: %timeit np.in1d(a,1).sum()
The slowest run took 9.17 times longer than the fastest. This could mean that an intermediate result is being cached 
10000 loops, best of 3: 29.9 µs per loop

In [132]: %timeit np.sum(a == 1)
10000 loops, best of 3: 196 µs per loop

In [133]: %timeit (a == 1).sum()
1000 loops, best of 3: 180 µs per loop

len(a)=6000:

In [135]: %timeit np.in1d(a,1).sum()
The slowest run took 7.29 times longer than the fastest. This could mean that an intermediate result is being cached 
10000 loops, best of 3: 48.5 µs per loop

In [136]: %timeit np.sum(a == 1)
The slowest run took 5.23 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 273 µs per loop

In [137]: %timeit (a == 1).sum()
1000 loops, best of 3: 271 µs per loop

139

answered Oct 13 '22 01:10

jezrael

If you take the value_counts return, you can query it for multiple values:

import pandas as pd

a = pd.Series([1, 1, 1, 1, 2, 2])
counts = a.value_counts()
>>> counts[1], counts[2]
(4, 2)

However, to count only a single item, it would be faster to use

import numpy as np
np.sum(a == 1)

answered Oct 13 '22 01:10

Ami Tavory

Related questions
                            
                                django database delete specific number of entries
                            
                                WARNING: IPython History requires SQLite, your history will not be saved
                            
                                How to remove dashed line from my menu UI?
                            
                                Make special diagonal matrix in Numpy
                            
                                What is a top-level statement in Python?
                            
                                python networkx remove nodes and edges with some condition
                            
                                Specify absolute colour for 3D points in MayaVi
                            
                                How do I alter a response in flask in the after_request function?
                            
                                Pandas select only numeric or integer field from dataframe
                            
                                Smallest enclosing circle, error in the code
                            
                                How can I get the total number of elements in my arbitrarily nested list of lists?
                            
                                Convert html to pdf using Python/Flask
                            
                                Celery worker hangs without any error
                            
                                Error when installing using pip
                            
                                Custom Colormap in Python
                            
                                How to setup PyCharm for multiple projects
                            
                                Find index of last true value in pandas Series or DataFrame
                            
                                Read a list of hostnames and resolve to IP addresses
                            
                                Accessing Request Object in Viewset and Serializers in Django Rest Framework?
                            
                                Understanding Stacks and Queues in python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With