Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting numbers from text with stringr and regex in R

Tags:

regex

r

stringr

I have a problem where I'm trying to extract numbers from a string containing text and numbers and then create two new columns showing the Min and Max of the numbers.

For example, I have one column and a string of data like this:

Text
Section 12345.01 to section 12345.02

And I want to create two new columns from the data in the Text column, like this:

Min        Max   
12345.01   12345.02

I'm using dplyr and stringr with regex, but the regex only extracts the first occurence of the pattern (first number).

df%>%dplyr::mutate(SectionNum = stringr::str_extract(Text, "\\d+.\\d+"))

If I try to use the stringr::str_extract_all function. It seems to extract both occurence of the pattern, but it creates a list in the tibble, which I find is a real hassle. So I'm stuck on the first step, just trying to get the numbers out into their own columns.

Can anyone recommend the most efficient way to do this? Ideally I'd like to extract the numbers from the string, convert them to numbers as.numeric and then run min() and max() functions.

like image 268
Seth Brundle Avatar asked Sep 24 '18 19:09

Seth Brundle


1 Answers

With extract from tidyr. extract turns each regex capture group into its own column. convert = TRUE is convenient in that it coerces the resulting columns to the best format. remove = FALSE can be used if we want to keep the original column. The last mutate is optional to make sure that the first number extracted is really the minimum:

library(tidyr)
library(purrr)

df %>%
  extract(Text, c("Min", "Max"), "([\\d.]+)[^\\d.]+([\\d.]+)", convert = TRUE) %>%
  mutate(Min = pmap_dbl(., min),
         Max = pmap_dbl(., max))

Output:

       Min      Max
1 12345.02 12345.03

Data:

df <- structure(list(Text = structure(1L, .Label = "Section 12345.03 to section 12345.02", class = "factor")), class = "data.frame", row.names = c(NA, 
-1L), .Names = "Text")
like image 122
acylam Avatar answered Sep 28 '22 03:09

acylam