Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expression matching inside dplyr

When answering this question, I wrote the following code:

df <- data.frame(Call_Num = c("HV5822.H4 C47 Circulating Collection, 3rd Floor", "QE511.4 .G53 1982 Circulating Collection, 3rd Floor", "TL515 .M63 Circulating Collection, 3rd Floor", "D753 .F4 Circulating Collection, 3rd Floor", "DB89.F7 D4 Circulating Collection, 3rd Floor"))

require(stringr)

matches = str_match(df$Call_Num, "([A-Z]+)(\\d+)\\s*\\.")
df2 <- data.frame(df, letter=matches[,2], number=matches[,3])

Now my question is: Is there a simple way to combine the last two lines into one dplyr call, presumably using mutate()? Alternatively, I'd interested in a solution with do() as well. For the mutate() approach, since we're extracting 2 groups, I'll take a solution that calls str_match() twice with different regular expressions, one for each desired group.

Edit: To clarify, the main challenge I see here is that str_match returns a matrix, and I'm wondering how to handle that in mutate() or do(). I'm not interested in solutions to the original problem using other methods of extracting the information. There are plenty of such solutions given already here.

like image 870
Claus Wilke Avatar asked Jul 07 '15 13:07

Claus Wilke


1 Answers

You could do this with extract() from the tidyr package:

extract(df, Call_Num, into = c("letter", "number"), regex = "([A-Z]+)(\\d+)\\s*\\.", remove = FALSE)

                                             Call_Num letter number
1     HV5822.H4 C47 Circulating Collection, 3rd Floor     HV   5822
2 QE511.4 .G53 1982 Circulating Collection, 3rd Floor     QE    511
3        TL515 .M63 Circulating Collection, 3rd Floor     TL    515
4          D753 .F4 Circulating Collection, 3rd Floor      D    753
5        DB89.F7 D4 Circulating Collection, 3rd Floor     DB     89

It's not dplyr, but as stated on the CRAN page linked above, tidyr "is designed specifically for data tidying (not general reshaping or aggregating) and works well with dplyr data pipelines."

like image 175
Sam Firke Avatar answered Sep 27 '22 22:09

Sam Firke