Count occurrences of each of certain words in pandas dataframe

Tags:

I want to count the number of occurrences of each of certain words in a data frame. I currently do it using str.contains:

a = df2[df2['col1'].str.contains("sample")].groupby('col2').size() n = a.apply(lambda x: 1).sum()

Is there a method to match regular expression and get the count of occurrences? In my case I have a large dataframe and I want to match around 100 strings.

815

asked Jul 10 '13 14:07

Nilani Algiriyage

2 Answers

To count the total number of matches, use s.str.match(...).str.get(0).count().

If your regex will be matching several unique words, to be tallied individually, use s.str.match(...).str.get(0).groupby(lambda x: x).count()

It works like this:

In [12]: s Out[12]:  0    ax 1    ay 2    bx 3    by 4    bz dtype: object

The match string method handles regular expressions...

In [13]: s.str.match('(b[x-y]+)') Out[13]:  0       [] 1       [] 2    (bx,) 3    (by,) 4       [] dtype: object

...but the results, as given, are not very convenient. The string method get takes the matches as strings and converts empty results to NaNs...

In [14]: s.str.match('(b[x-y]+)').str.get(0) Out[14]:  0    NaN 1    NaN 2     bx 3     by 4    NaN dtype: object

...which are not counted.

In [15]: s.str.match('(b[x-y]+)').str.get(0).count() Out[15]: 2

answered Sep 18 '22 13:09

Dan Allan

Update: Original answer counts those rows which contain a substring.

To count all the occurrences of a substring you can use .str.count:

In [21]: df = pd.DataFrame(['hello', 'world', 'hehe'], columns=['words'])  In [22]: df.words.str.count("he|wo") Out[22]: 0    1 1    1 2    2 Name: words, dtype: int64  In [23]: df.words.str.count("he|wo").sum() Out[23]: 4

The str.contains method accepts a regular expression:

Definition: df.words.str.contains(self, pat, case=True, flags=0, na=nan) Docstring: Check whether given pattern is contained in each string in the array  Parameters ---------- pat : string     Character sequence or regular expression case : boolean, default True     If True, case sensitive flags : int, default 0 (no flags)     re module flags, e.g. re.IGNORECASE na : default NaN, fill value for missing values.

For example:

In [11]: df = pd.DataFrame(['hello', 'world'], columns=['words'])  In [12]: df Out[12]:    words 0  hello 1  world  In [13]: df.words.str.contains(r'[hw]') Out[13]: 0    True 1    True Name: words, dtype: bool  In [14]: df.words.str.contains(r'he|wo') Out[14]: 0    True 1    True Name: words, dtype: bool

To count the occurences you can just sum this boolean Series:

In [15]: df.words.str.contains(r'he|wo').sum() Out[15]: 2  In [16]: df.words.str.contains(r'he').sum() Out[16]: 1

166

answered Sep 18 '22 13:09

Andy Hayden

Related questions
                            
                                How do I access part of a list in Jinja2
                            
                                python yaml.dump bad indentation
                            
                                Why don't I have xlrd?
                            
                                Scipy/Numpy FFT Frequency Analysis
                            
                                Capturing repeating subpatterns in Python regex
                            
                                How to create a commit and push into repo with GitHub API v3?
                            
                                Getting all field names from a protocol buffer?
                            
                                Repeating each element of a numpy array 5 times
                            
                                ValueError: Layer sequential_20 expects 1 inputs, but it received 2 input tensors
                            
                                What is internal representation of string in Python 3.x
                            
                                Get window position & size with python
                            
                                Is it possible to dereference variable id's?
                            
                                Travis special requirements for each python version
                            
                                sqlalchemy: create relations but without foreign key constraint in db?
                            
                                Python - Drop row if two columns are NaN
                            
                                returning numpy arrays via pybind11
                            
                                pip3 on python3.9 fails on 'HTMLParser' object has no attribute 'unescape' [duplicate]
                            
                                Listing serial (COM) ports on Windows?
                            
                                select from sqlite table where rowid in list using python sqlite3 — DB-API 2.0
                            
                                Crawling with an authenticated session in Scrapy

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Count occurrences of each of certain words in pandas dataframe

Tags:

python

pandas

dataframe

Nilani Algiriyage

People also ask

2 Answers

Dan Allan

Andy Hayden

Recent Activity

Donate For Us