From this question which was closed, the op asked how to extract rank, first, middle, and last from the strings
x <- c("Marshall Robert Forsyth", "Deputy Sheriff John A. Gooch",
"Constable Darius Quimby", "High Sheriff John Caldwell Cook")
# rank first middle last
# Marshall Robert Forsyth "Marshall" "Robert" "" "Forsyth"
# Deputy Sheriff John A. Gooch "Deputy Sheriff" "John" "A." "Gooch"
# Constable Darius Quimby "Constable" "Darius" "" "Quimby"
# High Sheriff John Caldwell. Cook "High Sheriff" "John" "Caldwell" "Cook"
I came up with this which only works if the middle name includes a period; otherwise, the pattern for rank captures as much as it can from the beginning of the line.
pat <- '(?i)(?<rank>[a-z ]+)\\s(?<first>[a-z]+)\\s(?:(?<middle>[a-z.]+)\\s)?(?<last>[a-z]+)'
f <- function(x, pattern) {
m <- gregexpr(pattern, x, perl = TRUE)[[1]]
s <- attr(m, "capture.start")
l <- attr(m, "capture.length")
n <- attr(m, "capture.names")
setNames(mapply('substr', x, s, s + l - 1L), n)
}
do.call('rbind', Map(f, x, pat))
# rank first middle last
# Marshall Robert Forsyth "Marshall" "Robert" "" "Forsyth"
# Deputy Sheriff John A. Gooch "Deputy Sheriff" "John" "A." "Gooch"
# Constable Darius Quimby "Constable" "Darius" "" "Quimby"
# High Sheriff John Caldwell Cook "High Sheriff John" "Caldwell" "" "Cook"
So this would work if the middle name was either not given or included a period
x <- c("Marshall Robert Forsyth", "Deputy Sheriff John A. Gooch",
"Constable Darius Quimby", "High Sheriff John Caldwell. Cook")
do.call('rbind', Map(f, x, pat))
So my question is is there a way to prioritize matching from the end of the string such that this pattern matches last, middle, first, then leaving everything else for rank.
Can I do this without reversing the string or something hacky like that? Also, maybe there is a better pattern since I am not great with regex.
Related - [1] [2] - I don't think these will work since another pattern was suggested rather than answering the question. Also, in this example, the number of words in the rank is arbitrary, and the pattern matching the rank would also work for the first name.
They are called “anchors”. The caret ^ matches at the beginning of the text, and the dollar $ – at the end. The pattern ^Mary means: “string start and then Mary”.
The meta character “^” matches the beginning of a particular string i.e. it matches the first character of the string. For example, The expression “^\d” matches the string/line starting with a digit. The expression “^[a-z]” matches the string/line starting with a lower case alphabet.
The ^ and $ match the beginning and ending of the input string, respectively. The \s (lowercase s ) matches a whitespace (blank, tab \t , and newline \r or \n ). On the other hand, the \S+ (uppercase S ) matches anything that is NOT matched by \s , i.e., non-whitespace.
Python String endswith() Method Python string method endswith() returns True if the string ends with the specified suffix, otherwise return False optionally restricting the matching with the given indices start and end.
We cannot start matching from the end, there are no any modifiers for that in any regex systems I know. But we can check how many words do we have until the end, and restrain our greediness :). The below regex is doing this.
^(?<rank>(?:(?:[ \t]|^)[a-z]+)+?)(?!(?:[ \t][a-z.]+){4,}$)[ \t](?<first>[a-z]+)[ \t](?:(?<middle>[a-z.]+)[ \t])?(?<last>[a-z]+)$
Live preview in regex101.com
when you have First, Last and more than 1 word for the rank, the part of rank will become a First name.
To solve this you have to define a list of rank prefixes which mean that there's another word definitely goes after it and capture it in a greedy way.
E.g.: Deputy,High.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With