Start matching from the end of a string

Tags:

r

From this question which was closed, the op asked how to extract rank, first, middle, and last from the strings

x <- c("Marshall Robert Forsyth", "Deputy Sheriff John A. Gooch",
       "Constable Darius Quimby", "High Sheriff John Caldwell Cook")

#                                  rank             first    middle      last     
# Marshall Robert Forsyth          "Marshall"       "Robert" ""          "Forsyth"
# Deputy Sheriff John A. Gooch     "Deputy Sheriff" "John"   "A."        "Gooch"  
# Constable Darius Quimby          "Constable"      "Darius" ""          "Quimby" 
# High Sheriff John Caldwell. Cook "High Sheriff"   "John"   "Caldwell"  "Cook"

I came up with this which only works if the middle name includes a period; otherwise, the pattern for rank captures as much as it can from the beginning of the line.

pat <- '(?i)(?<rank>[a-z ]+)\\s(?<first>[a-z]+)\\s(?:(?<middle>[a-z.]+)\\s)?(?<last>[a-z]+)'

f <- function(x, pattern) {
  m <- gregexpr(pattern, x, perl = TRUE)[[1]]
  s <- attr(m, "capture.start")
  l <- attr(m, "capture.length")
  n <- attr(m, "capture.names")
  setNames(mapply('substr', x, s, s + l - 1L), n)
}

do.call('rbind', Map(f, x, pat))

#                                 rank                first      middle last     
# Marshall Robert Forsyth         "Marshall"          "Robert"   ""     "Forsyth"
# Deputy Sheriff John A. Gooch    "Deputy Sheriff"    "John"     "A."   "Gooch"  
# Constable Darius Quimby         "Constable"         "Darius"   ""     "Quimby" 
# High Sheriff John Caldwell Cook "High Sheriff John" "Caldwell" ""     "Cook"

So this would work if the middle name was either not given or included a period

x <- c("Marshall Robert Forsyth", "Deputy Sheriff John A. Gooch",
       "Constable Darius Quimby", "High Sheriff John Caldwell. Cook")
do.call('rbind', Map(f, x, pat))

So my question is is there a way to prioritize matching from the end of the string such that this pattern matches last, middle, first, then leaving everything else for rank.

Can I do this without reversing the string or something hacky like that? Also, maybe there is a better pattern since I am not great with regex.

Related - [1] [2] - I don't think these will work since another pattern was suggested rather than answering the question. Also, in this example, the number of words in the rank is arbitrary, and the pattern matching the rank would also work for the first name.

482

asked Nov 13 '16 15:11

rawr

1 Answers

We cannot start matching from the end, there are no any modifiers for that in any regex systems I know. But we can check how many words do we have until the end, and restrain our greediness :). The below regex is doing this.

This one will do what you want:

^(?<rank>(?:(?:[ \t]|^)[a-z]+)+?)(?!(?:[ \t][a-z.]+){4,}$)[ \t](?<first>[a-z]+)[ \t](?:(?<middle>[a-z.]+)[ \t])?(?<last>[a-z]+)$

Live preview in regex101.com

enter image description here

There's also one exception:

when you have First, Last and more than 1 word for the rank, the part of rank will become a First name.

enter image description here

To solve this you have to define a list of rank prefixes which mean that there's another word definitely goes after it and capture it in a greedy way.

E.g.: Deputy,High.

170

answered Oct 09 '22 00:10

NikitOn

Related questions
                            
                                Making nicely formatted tables in Markdown: knitr not compiling stargazer>html table
                            
                                Read a CSV in R as a data.frame
                            
                                Quanstrat strategy - error
                            
                                Customising R markdown pdf document
                            
                                Possible to use knitr cache chunk in interactive rmarkdown doc?
                            
                                Using DiagrammeR in Word document (generated using rMarkdown)
                            
                                Is there a general inverse of the table() function?
                            
                                Create abbreviated legends manually for long X labels in ggplot2
                            
                                Error in gzfile(file, "rb") - what should I do?
                            
                                using ggsave and arrangeGrob after updating gridExtra to 2.0.0
                            
                                Manipulating NumericMatrix in Rcpp
                            
                                unable to install packages("caret") completely in R version 3.2.3
                            
                                Ggplot2 different alpha behaviour [duplicate]
                            
                                Programming with dplyr and lazyeval
                            
                                Identify position of a click on a raster in leaflet, in R
                            
                                gdal_calc amin fails when passing more than 23 input files
                            
                                Jupyter + rpy2 outputs to command prompt instead of notebook cell
                            
                                Why are there wrong and/or inconsistent degrees of freedom from R's t-test function?
                            
                                Why is the address of a loop variable changing when using it?
                            
                                R: Why does "ifelse" coerce factor into integer? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With