I am new to regex in R. Here I have a vector where I am interested in extracting the first occurance of a number in each string of the vector .
I have a vector called "shootsummary" which looks like this.
> head(shootsummary)
[1] Aaron Alexis, 34, a military veteran and contractor from Texas, opened fire in the Navy installation, killing 12 people and wounding 8 before being shot dead by police.
[2] Pedro Vargas, 42, set fire to his apartment, killed six people in the complex, and held another two hostages at gunpoint before a SWAT team stormed the building and fatally shot him.
[3] John Zawahri, 23, armed with a homemade assault rifle and high-capacity magazines, killed his brother and father at home and then headed to Santa Monica College, where he was eventually killed by police.
[4] Dennis Clark III, 27, shot and killed his girlfriend in their shared apartment, and then shot two witnesses in the building's parking lot and a third victim in another apartment, before being killed by police.
[5] Kurt Myers, 64, shot six people in neighboring towns, killing two in a barbershop and two at a car care business, before being killed by officers in a shootout after a nearly 19-hour standoff.
The first occurance of a number in each string denotes 'age' of the individual and I am interested in extracting ages from these strings without mixing them with other numbers in the lines listed .
I used:
as.numeric(gsub("\\D", "", shootsummary))
It resulted in :
[1] 34128 42 23 27 6419
I am looking for a result that looks like this with just the ages extracted from the sentence without extracting other numbers that occur after the age.
[1] 34 42 23 27 64
stringi
would be faster
library(stringi)
stri_extract_first(shootsummary, regex="\\d+")
#[1] "34" "42" "23" "27" "64"
You could try the below sub
command,
> test
[1] "Aaron Alexis, 34, a military veteran and contractor from Texas, opened fire in the Navy installation, killing 12 people and wounding 8 before being shot dead by police."
[2] "Pedro Vargas, 42, set fire to his apartment, killed six people in the complex, and held another two hostages at gunpoint before a SWAT team stormed the building and fatally shot him."
> sub("^\\D*(\\d+).*$", "\\1", test)
[1] "34" "42"
Pattern Explanation:
^
asserts that we are at the start of a line.\D*
Matches zero or more non-digit characters.(\d+)
then the following one or more digits is captured into group 1(first number)..*
Matches any character zero or more times.$
Asserts that we are at the end of a line.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With