Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Count occurrences of each of certain words in pandas dataframe

I want to count the number of occurrences of each of certain words in a data frame. I currently do it using str.contains:

a = df2[df2['col1'].str.contains("sample")].groupby('col2').size() n = a.apply(lambda x: 1).sum() 

Is there a method to match regular expression and get the count of occurrences? In my case I have a large dataframe and I want to match around 100 strings.

like image 815
Nilani Algiriyage Avatar asked Jul 10 '13 14:07

Nilani Algiriyage


People also ask

How do you count occurrences in a DataFrame in Python?

Using the size() or count() method with pandas. DataFrame. groupby() will generate the count of a number of occurrences of data present in a particular column of the dataframe.

How do you count specific values in a data frame?

Use Sum Function to Count Specific Values in a Column in a Dataframe. We can use the sum() function on a specified column to count values equal to a set condition, in this case we use == to get just rows equal to our specific data point. If we wanted to count specific values that match another boolean operation we can.

How do you count unique occurrences in Pandas?

You can use the nunique() function to count the number of unique values in a pandas DataFrame.

How do you calculate the number of characters in each word in a series Pandas?

To calculate the numbers of characters we use Series. str. len(). This function returns the count of the characters in each word in a series.


2 Answers

To count the total number of matches, use s.str.match(...).str.get(0).count().

If your regex will be matching several unique words, to be tallied individually, use s.str.match(...).str.get(0).groupby(lambda x: x).count()

It works like this:

In [12]: s Out[12]:  0    ax 1    ay 2    bx 3    by 4    bz dtype: object 

The match string method handles regular expressions...

In [13]: s.str.match('(b[x-y]+)') Out[13]:  0       [] 1       [] 2    (bx,) 3    (by,) 4       [] dtype: object 

...but the results, as given, are not very convenient. The string method get takes the matches as strings and converts empty results to NaNs...

In [14]: s.str.match('(b[x-y]+)').str.get(0) Out[14]:  0    NaN 1    NaN 2     bx 3     by 4    NaN dtype: object 

...which are not counted.

In [15]: s.str.match('(b[x-y]+)').str.get(0).count() Out[15]: 2 
like image 37
Dan Allan Avatar answered Sep 18 '22 13:09

Dan Allan


Update: Original answer counts those rows which contain a substring.

To count all the occurrences of a substring you can use .str.count:

In [21]: df = pd.DataFrame(['hello', 'world', 'hehe'], columns=['words'])  In [22]: df.words.str.count("he|wo") Out[22]: 0    1 1    1 2    2 Name: words, dtype: int64  In [23]: df.words.str.count("he|wo").sum() Out[23]: 4 

The str.contains method accepts a regular expression:

Definition: df.words.str.contains(self, pat, case=True, flags=0, na=nan) Docstring: Check whether given pattern is contained in each string in the array  Parameters ---------- pat : string     Character sequence or regular expression case : boolean, default True     If True, case sensitive flags : int, default 0 (no flags)     re module flags, e.g. re.IGNORECASE na : default NaN, fill value for missing values. 

For example:

In [11]: df = pd.DataFrame(['hello', 'world'], columns=['words'])  In [12]: df Out[12]:    words 0  hello 1  world  In [13]: df.words.str.contains(r'[hw]') Out[13]: 0    True 1    True Name: words, dtype: bool  In [14]: df.words.str.contains(r'he|wo') Out[14]: 0    True 1    True Name: words, dtype: bool 

To count the occurences you can just sum this boolean Series:

In [15]: df.words.str.contains(r'he|wo').sum() Out[15]: 2  In [16]: df.words.str.contains(r'he').sum() Out[16]: 1 
like image 166
Andy Hayden Avatar answered Sep 18 '22 13:09

Andy Hayden