For all intents and purposes, I am a Python user and use the Pandas library on a daily basis. The named capture groups in regex is extremely useful. So, for example, it is relatively trivial to extract occurrences of specific words or phrases and to produce concatenated strings of the results in new columns of a dataframe. An example of how this might be achieved is given below:
import numpy as np
import pandas as pd
import re
myDF = pd.DataFrame(['Here is some text',
'We all love TEXT',
'Where is the TXT or txt textfile',
'Words and words',
'Just a few works',
'See the text',
'both words and text'],columns=['origText'])
print("Original dataframe\n------------------")
print(myDF)
# Define regex to find occurrences of 'text' or 'word' as separate named groups
myRegex = """(?P<textOcc>t[e]?xt)|(?P<wordOcc>word)"""
myCompiledRegex = re.compile(myRegex,flags=re.I|re.X)
# Extract all occurrences of 'text' or 'word'
myMatchesDF = myDF['origText'].str.extractall(myCompiledRegex)
print("\nDataframe of matches (with multi-index)\n--------------------")
print(myMatchesDF)
# Collapse resulting multi-index dataframe into single rows with concatenated fields
myConcatMatchesDF = myMatchesDF.groupby(level = 0).agg(lambda x: '///'.join(x.fillna('')))
myConcatMatchesDF = myConcatMatchesDF.replace(to_replace = "^/+|/+$",value = "",regex = True) # Remove '///' at start and end of strings
print("\nCollapsed and concatenated matches\n----------------------------------")
print(myConcatMatchesDF)
myDF = myDF.join(myConcatMatchesDF)
print("\nFinal joined dataframe\n----------------------")
print(myDF)
This produces the following output:
Original dataframe
------------------
origText
0 Here is some text
1 We all love TEXT
2 Where is the TXT or txt textfile
3 Words and words
4 Just a few works
5 See the text
6 both words and text
Dataframe of matches (with multi-index)
--------------------
textOcc wordOcc
match
0 0 text NaN
1 0 TEXT NaN
2 0 TXT NaN
1 txt NaN
2 text NaN
3 0 NaN Word
1 NaN word
5 0 text NaN
6 0 NaN word
1 text NaN
Collapsed and concatenated matches
----------------------------------
textOcc wordOcc
0 text
1 TEXT
2 TXT///txt///text
3 Word///word
5 text
6 text word
Final joined dataframe
----------------------
origText textOcc wordOcc
0 Here is some text text
1 We all love TEXT TEXT
2 Where is the TXT or txt textfile TXT///txt///text
3 Words and words Word///word
4 Just a few works NaN NaN
5 See the text text
6 both words and text text word
I've printed each stage to try to make it easy to follow.
The question is, can I do something similar in R. I've searched the web but can't find anything that describes the use of named groups (although I'm an R-newcomer and so might be searching for the wrong libraries or descriptive terms).
I've been able to identify those items that contain one or more matches but I cannot see how to extract specific matches or how to make use of the named groups. The code I have so far (using the same dataframe and regex as in the Python example above) is:
origText = c('Here is some text','We all love TEXT','Where is the TXT or txt textfile','Words and words','Just a few works','See the text','both words and text')
myDF = data.frame(origText)
myRegex = "(?P<textOcc>t[e]?xt)|(?P<wordOcc>word)"
myMatches = grep(myRegex,myDF$origText,perl=TRUE,value=TRUE,ignore.case=TRUE)
myMatches
[1] "Here is some text" "We all love TEXT" "Where is the TXT or txt textfile" "Words and words"
[5] "See the text" "both words and text"
myMatchesRow = grep(myRegex,myDF$origText,perl=TRUE,value=FALSE,ignore.case=TRUE)
myMatchesRow
[1] 1 2 3 4 6 7
The regex seems to be working and the correct rows are identified as containing a match (i.e. all except row 5 in the above example). However, my question is, can I produce an output that is similar to that produced by Python where the specific matches are extracted and listed in new columns in the dataframe that are named using the group names contained in the regex?
Named groups that share the same name are treated as one an the same group, so there are no pitfalls when using backreferences to that name. If a regex has multiple groups with the same name, backreferences using that name point to the leftmost group in the regex with that name.
If your regular expression has named capturing groups, then you should use named backreferences to them in the replacement text. The regex (?' name'group) has one group called “name”. You can reference this group with ${name} in the JGsoft applications, Delphi, .
P is a named capturing group, as opposed to an unnamed capturing group. (? P<name>...) Similar to regular parentheses, but the substring matched by the group is accessible within the rest of the regular expression via the symbolic group name name.
Base R does capture the information about the names but it doesn't have a good helper to extract them by name. I write a wrapper to help called regcapturedmatches. You can use it with
myRegex = "(?<textOcc>t[e]?xt)|(?<wordOcc>word)"
m<-regexpr(myRegex, origText, perl=T, ignore.case=T)
regcapturedmatches(origText,m)
Which returns
textOcc wordOcc
[1,] "text" ""
[2,] "TEXT" ""
[3,] "TXT" ""
[4,] "" "Word"
[5,] "" ""
[6,] "text" ""
[7,] "" "word"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With