Regex named groups in R

Tags:

For all intents and purposes, I am a Python user and use the Pandas library on a daily basis. The named capture groups in regex is extremely useful. So, for example, it is relatively trivial to extract occurrences of specific words or phrases and to produce concatenated strings of the results in new columns of a dataframe. An example of how this might be achieved is given below:

Click to copy

import numpy as np
import pandas as pd
import re

myDF = pd.DataFrame(['Here is some text',
                     'We all love TEXT',
                     'Where is the TXT or txt textfile',
                     'Words and words',
                     'Just a few works',
                     'See the text',
                     'both words and text'],columns=['origText'])

print("Original dataframe\n------------------")
print(myDF)

# Define regex to find occurrences of 'text' or 'word' as separate named groups
myRegex = """(?P<textOcc>t[e]?xt)|(?P<wordOcc>word)"""
myCompiledRegex = re.compile(myRegex,flags=re.I|re.X)

# Extract all occurrences of 'text' or 'word'
myMatchesDF = myDF['origText'].str.extractall(myCompiledRegex)
print("\nDataframe of matches (with multi-index)\n--------------------")
print(myMatchesDF)

# Collapse resulting multi-index dataframe into single rows with concatenated fields
myConcatMatchesDF = myMatchesDF.groupby(level = 0).agg(lambda x: '///'.join(x.fillna('')))
myConcatMatchesDF = myConcatMatchesDF.replace(to_replace = "^/+|/+$",value = "",regex = True) # Remove '///' at start and end of strings
print("\nCollapsed and concatenated matches\n----------------------------------")
print(myConcatMatchesDF)

myDF = myDF.join(myConcatMatchesDF)
print("\nFinal joined dataframe\n----------------------")
print(myDF)

This produces the following output:

Click to copy

Original dataframe
------------------
                           origText
0                 Here is some text
1                  We all love TEXT
2  Where is the TXT or txt textfile
3                   Words and words
4                  Just a few works
5                      See the text
6               both words and text

Dataframe of matches (with multi-index)
--------------------
        textOcc wordOcc
  match                
0 0        text     NaN
1 0        TEXT     NaN
2 0         TXT     NaN
  1         txt     NaN
  2        text     NaN
3 0         NaN    Word
  1         NaN    word
5 0        text     NaN
6 0         NaN    word
  1        text     NaN

Collapsed and concatenated matches
----------------------------------
            textOcc      wordOcc
0              text             
1              TEXT             
2  TXT///txt///text             
3                    Word///word
5              text             
6              text         word

Final joined dataframe
----------------------
                           origText           textOcc      wordOcc
0                 Here is some text              text             
1                  We all love TEXT              TEXT             
2  Where is the TXT or txt textfile  TXT///txt///text             
3                   Words and words                    Word///word
4                  Just a few works               NaN          NaN
5                      See the text              text             
6               both words and text              text         word

I've printed each stage to try to make it easy to follow.

The question is, can I do something similar in R. I've searched the web but can't find anything that describes the use of named groups (although I'm an R-newcomer and so might be searching for the wrong libraries or descriptive terms).

I've been able to identify those items that contain one or more matches but I cannot see how to extract specific matches or how to make use of the named groups. The code I have so far (using the same dataframe and regex as in the Python example above) is:

Click to copy

origText = c('Here is some text','We all love TEXT','Where is the TXT or txt textfile','Words and words','Just a few works','See the text','both words and text')
myDF = data.frame(origText)
myRegex = "(?P<textOcc>t[e]?xt)|(?P<wordOcc>word)"
myMatches = grep(myRegex,myDF$origText,perl=TRUE,value=TRUE,ignore.case=TRUE)
myMatches
[1] "Here is some text"                "We all love TEXT"                 "Where is the TXT or txt textfile" "Words and words"                 
[5] "See the text"                     "both words and text"             

myMatchesRow = grep(myRegex,myDF$origText,perl=TRUE,value=FALSE,ignore.case=TRUE)
myMatchesRow
[1] 1 2 3 4 6 7

The regex seems to be working and the correct rows are identified as containing a match (i.e. all except row 5 in the above example). However, my question is, can I produce an output that is similar to that produced by Python where the specific matches are extracted and listed in new columns in the dataframe that are named using the group names contained in the regex?

303

asked Aug 21 '17 16:08

user1718097

Video Answer

1 Answers

Base R does capture the information about the names but it doesn't have a good helper to extract them by name. I write a wrapper to help called regcapturedmatches. You can use it with

Click to copy

myRegex = "(?<textOcc>t[e]?xt)|(?<wordOcc>word)"
m<-regexpr(myRegex, origText, perl=T, ignore.case=T)
regcapturedmatches(origText,m)

Which returns

Click to copy

     textOcc wordOcc
[1,] "text"  ""     
[2,] "TEXT"  ""     
[3,] "TXT"   ""     
[4,] ""      "Word" 
[5,] ""      ""     
[6,] "text"  ""     
[7,] ""      "word"

137

answered Oct 09 '22 13:10

MrFlick

Related questions
                            
                                Trouble running Python script CRON: Import Error: No Module Named Tweepy
                            
                                How to merge two data frames while excluding the NaN value column?
                            
                                How to mock a protected/private method in a tested method?
                            
                                compare a datetime column only to time in pandas
                            
                                Access list elements that are not equal to a specific value
                            
                                Pandas: How to assign sum() or mean() to df.groupby inside a function?
                            
                                TypeError: Mismatch between array dtype ('float64') and format specifier
                            
                                Pyomo ValueError: Invalid constraint expression
                            
                                Is it possible to solve equations of bit wise operators?
                            
                                How to interpret Sklearn LDA perplexity score. Why it always increase as number of topics increase?
                            
                                Python: How to code an exponential moving average?
                            
                                Computing percentage difference between pandas dataframe rows
                            
                                Query a specific JSON column (postgres) with sqlalchemy
                            
                                with pyspark.sql.functions unix_timestamp get null
                            
                                Django how to set value of hidden input in template
                            
                                factoryboy not working with freezegun
                            
                                Preserve order of rows when converting pandas Dataframe to dictionary
                            
                                Getting unique tuples from a list [duplicate]
                            
                                How to return redirect response from aiohttp.web server
                            
                                How to Navigate Context Menus (Selenium, Python)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Regex named groups in R

Tags:

python

regex

pandas

r

regex-group

user1718097

People also ask

Video Answer

1 Answers

MrFlick

Recent Activity

Donate For Us