I have a problem where I'm trying to extract numbers from a string containing text and numbers and then create two new columns showing the Min and Max of the numbers.
For example, I have one column and a string of data like this:
Text
Section 12345.01 to section 12345.02
And I want to create two new columns from the data in the Text column, like this:
Min Max
12345.01 12345.02
I'm using dplyr and stringr with regex, but the regex only extracts the first occurence of the pattern (first number).
df%>%dplyr::mutate(SectionNum = stringr::str_extract(Text, "\\d+.\\d+"))
If I try to use the stringr::str_extract_all
function. It seems to extract both occurence of the pattern, but it creates a list in the tibble, which I find is a real hassle. So I'm stuck on the first step, just trying to get the numbers out into their own columns.
Can anyone recommend the most efficient way to do this? Ideally I'd like to extract the numbers from the string, convert them to numbers as.numeric
and then run min()
and max()
functions.
With extract
from tidyr
. extract
turns each regex capture group into its own column. convert = TRUE
is convenient in that it coerces the resulting columns to the best format. remove = FALSE
can be used if we want to keep the original column. The last mutate
is optional to make sure that the first number extracted is really the minimum:
library(tidyr)
library(purrr)
df %>%
extract(Text, c("Min", "Max"), "([\\d.]+)[^\\d.]+([\\d.]+)", convert = TRUE) %>%
mutate(Min = pmap_dbl(., min),
Max = pmap_dbl(., max))
Output:
Min Max
1 12345.02 12345.03
Data:
df <- structure(list(Text = structure(1L, .Label = "Section 12345.03 to section 12345.02", class = "factor")), class = "data.frame", row.names = c(NA,
-1L), .Names = "Text")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With