I am trying to find the number of times a certain value appears in one column.
I have made the dataframe with data = pd.DataFrame.from_csv('data/DataSet2.csv')
and now I want to find the number of times something appears in a column. How is this done?
I thought it was the below, where I am looking in the education column and counting the number of time ?
occurs.
The code below shows that I am trying to find the number of times 9th
appears and the error is what I am getting when I run the code
Code
missing2 = df.education.value_counts()['9th'] print(missing2)
Error
KeyError: '9th'
Use Sum Function to Count Specific Values in a Column in a Dataframe. We can use the sum() function on a specified column to count values equal to a set condition, in this case we use == to get just rows equal to our specific data point. If we wanted to count specific values that match another boolean operation we can.
The count() is a built-in function in Python. It will return you the count of a given element in a list or a string. In the case of a list, the element to be counted needs to be given to the count() function, and it will return the count of the element. The count() method returns an integer value.
You can count the number of duplicate rows by counting True in pandas. Series obtained with duplicated() . The number of True can be counted with sum() method. If you want to count the number of False (= the number of non-duplicate rows), you can invert it with negation ~ and then count True with sum() .
You can create subset
of data with your condition and then use shape
or len
:
print df col1 education 0 a 9th 1 b 9th 2 c 8th print df.education == '9th' 0 True 1 True 2 False Name: education, dtype: bool print df[df.education == '9th'] col1 education 0 a 9th 1 b 9th print df[df.education == '9th'].shape[0] 2 print len(df[df['education'] == '9th']) 2
Performance is interesting, the fastest solution is compare numpy array and sum
:
Code:
import perfplot, string np.random.seed(123) def shape(df): return df[df.education == 'a'].shape[0] def len_df(df): return len(df[df['education'] == 'a']) def query_count(df): return df.query('education == "a"').education.count() def sum_mask(df): return (df.education == 'a').sum() def sum_mask_numpy(df): return (df.education.values == 'a').sum() def make_df(n): L = list(string.ascii_letters) df = pd.DataFrame(np.random.choice(L, size=n), columns=['education']) return df perfplot.show( setup=make_df, kernels=[shape, len_df, query_count, sum_mask, sum_mask_numpy], n_range=[2**k for k in range(2, 25)], logx=True, logy=True, equality_check=False, xlabel='len(df)')
Couple of ways using count
or sum
In [338]: df Out[338]: col1 education 0 a 9th 1 b 9th 2 c 8th In [335]: df.loc[df.education == '9th', 'education'].count() Out[335]: 2 In [336]: (df.education == '9th').sum() Out[336]: 2 In [337]: df.query('education == "9th"').education.count() Out[337]: 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With