Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Start matching from the end of a string

Tags:

regex

r

From this question which was closed, the op asked how to extract rank, first, middle, and last from the strings

x <- c("Marshall Robert Forsyth", "Deputy Sheriff John A. Gooch",
       "Constable Darius Quimby", "High Sheriff John Caldwell Cook")

#                                  rank             first    middle      last     
# Marshall Robert Forsyth          "Marshall"       "Robert" ""          "Forsyth"
# Deputy Sheriff John A. Gooch     "Deputy Sheriff" "John"   "A."        "Gooch"  
# Constable Darius Quimby          "Constable"      "Darius" ""          "Quimby" 
# High Sheriff John Caldwell. Cook "High Sheriff"   "John"   "Caldwell"  "Cook"

I came up with this which only works if the middle name includes a period; otherwise, the pattern for rank captures as much as it can from the beginning of the line.

pat <- '(?i)(?<rank>[a-z ]+)\\s(?<first>[a-z]+)\\s(?:(?<middle>[a-z.]+)\\s)?(?<last>[a-z]+)'

f <- function(x, pattern) {
  m <- gregexpr(pattern, x, perl = TRUE)[[1]]
  s <- attr(m, "capture.start")
  l <- attr(m, "capture.length")
  n <- attr(m, "capture.names")
  setNames(mapply('substr', x, s, s + l - 1L), n)
}

do.call('rbind', Map(f, x, pat))

#                                 rank                first      middle last     
# Marshall Robert Forsyth         "Marshall"          "Robert"   ""     "Forsyth"
# Deputy Sheriff John A. Gooch    "Deputy Sheriff"    "John"     "A."   "Gooch"  
# Constable Darius Quimby         "Constable"         "Darius"   ""     "Quimby" 
# High Sheriff John Caldwell Cook "High Sheriff John" "Caldwell" ""     "Cook"

So this would work if the middle name was either not given or included a period

x <- c("Marshall Robert Forsyth", "Deputy Sheriff John A. Gooch",
       "Constable Darius Quimby", "High Sheriff John Caldwell. Cook")
do.call('rbind', Map(f, x, pat))

So my question is is there a way to prioritize matching from the end of the string such that this pattern matches last, middle, first, then leaving everything else for rank.

Can I do this without reversing the string or something hacky like that? Also, maybe there is a better pattern since I am not great with regex.


Related - [1] [2] - I don't think these will work since another pattern was suggested rather than answering the question. Also, in this example, the number of words in the rank is arbitrary, and the pattern matching the rank would also work for the first name.

like image 482
rawr Avatar asked Nov 13 '16 15:11

rawr


People also ask

What matches the start and end of the string?

They are called “anchors”. The caret ^ matches at the beginning of the text, and the dollar $ – at the end. The pattern ^Mary means: “string start and then Mary”.

How do you match the start of a string?

The meta character “^” matches the beginning of a particular string i.e. it matches the first character of the string. For example, The expression “^\d” matches the string/line starting with a digit. The expression “^[a-z]” matches the string/line starting with a lower case alphabet.

Which character is used to match end of a string?

The ^ and $ match the beginning and ending of the input string, respectively. The \s (lowercase s ) matches a whitespace (blank, tab \t , and newline \r or \n ). On the other hand, the \S+ (uppercase S ) matches anything that is NOT matched by \s , i.e., non-whitespace.

How do you match the end of a string in Python?

Python String endswith() Method Python string method endswith() returns True if the string ends with the specified suffix, otherwise return False optionally restricting the matching with the given indices start and end.


1 Answers

We cannot start matching from the end, there are no any modifiers for that in any regex systems I know. But we can check how many words do we have until the end, and restrain our greediness :). The below regex is doing this.

This one will do what you want:

^(?<rank>(?:(?:[ \t]|^)[a-z]+)+?)(?!(?:[ \t][a-z.]+){4,}$)[ \t](?<first>[a-z]+)[ \t](?:(?<middle>[a-z.]+)[ \t])?(?<last>[a-z]+)$

Live preview in regex101.com

enter image description here

There's also one exception:

when you have First, Last and more than 1 word for the rank, the part of rank will become a First name.

enter image description here

To solve this you have to define a list of rank prefixes which mean that there's another word definitely goes after it and capture it in a greedy way.

E.g.: Deputy,High.

like image 170
NikitOn Avatar answered Oct 09 '22 00:10

NikitOn