I want to count the number of occurrences of each of certain words in a data frame. I currently do it using str.contains
:
a = df2[df2['col1'].str.contains("sample")].groupby('col2').size() n = a.apply(lambda x: 1).sum()
Is there a method to match regular expression and get the count of occurrences? In my case I have a large dataframe and I want to match around 100 strings.
Using the size() or count() method with pandas. DataFrame. groupby() will generate the count of a number of occurrences of data present in a particular column of the dataframe.
Use Sum Function to Count Specific Values in a Column in a Dataframe. We can use the sum() function on a specified column to count values equal to a set condition, in this case we use == to get just rows equal to our specific data point. If we wanted to count specific values that match another boolean operation we can.
You can use the nunique() function to count the number of unique values in a pandas DataFrame.
To calculate the numbers of characters we use Series. str. len(). This function returns the count of the characters in each word in a series.
To count the total number of matches, use s.str.match(...).str.get(0).count()
.
If your regex will be matching several unique words, to be tallied individually, use s.str.match(...).str.get(0).groupby(lambda x: x).count()
It works like this:
In [12]: s Out[12]: 0 ax 1 ay 2 bx 3 by 4 bz dtype: object
The match
string method handles regular expressions...
In [13]: s.str.match('(b[x-y]+)') Out[13]: 0 [] 1 [] 2 (bx,) 3 (by,) 4 [] dtype: object
...but the results, as given, are not very convenient. The string method get
takes the matches as strings and converts empty results to NaNs...
In [14]: s.str.match('(b[x-y]+)').str.get(0) Out[14]: 0 NaN 1 NaN 2 bx 3 by 4 NaN dtype: object
...which are not counted.
In [15]: s.str.match('(b[x-y]+)').str.get(0).count() Out[15]: 2
Update: Original answer counts those rows which contain a substring.
To count all the occurrences of a substring you can use .str.count
:
In [21]: df = pd.DataFrame(['hello', 'world', 'hehe'], columns=['words']) In [22]: df.words.str.count("he|wo") Out[22]: 0 1 1 1 2 2 Name: words, dtype: int64 In [23]: df.words.str.count("he|wo").sum() Out[23]: 4
The str.contains
method accepts a regular expression:
Definition: df.words.str.contains(self, pat, case=True, flags=0, na=nan) Docstring: Check whether given pattern is contained in each string in the array Parameters ---------- pat : string Character sequence or regular expression case : boolean, default True If True, case sensitive flags : int, default 0 (no flags) re module flags, e.g. re.IGNORECASE na : default NaN, fill value for missing values.
For example:
In [11]: df = pd.DataFrame(['hello', 'world'], columns=['words']) In [12]: df Out[12]: words 0 hello 1 world In [13]: df.words.str.contains(r'[hw]') Out[13]: 0 True 1 True Name: words, dtype: bool In [14]: df.words.str.contains(r'he|wo') Out[14]: 0 True 1 True Name: words, dtype: bool
To count the occurences you can just sum this boolean Series:
In [15]: df.words.str.contains(r'he|wo').sum() Out[15]: 2 In [16]: df.words.str.contains(r'he').sum() Out[16]: 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With