The problem is to efficiently parse data of this format:
lineup = " C James McCann P Robbie Ray P Rafael Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"
into a dataframe of two columns; one for the position, and one for the player.
The names are baseball players, and each name is prefaced with their position, which is the exact set {C, P, P, OF, 3B, SS, 1B, OF, 2B, OF} in some order. That is, those exact positions always occur.
For example, "C James McCann" should turn into
data.frame(position = "C", player = "James McCann")
In reality, I have many hundreds of thousands of such strings, and I want to parse them efficiently. Here is my inefficient solution:
data.frame(
position = str_match_all(lineup, "\\s[0-9A-Z]{1,2}\\s")[[1]] %>% as.character() %>% str_trim(),
player = str_split(lineup, "\\s[0-9A-Z]{1,2}\\s")[[1]][-1],
stringsAsFactors = F
)
This tidyverse solution is simple, but I suspect I can do much better. Does anyone have any ideas?
You could make a single pattern that would get you both the position and the player name with stringi::stri_match_all_regex:
stri_match_all_regex(lineup,
patt= "(C|P|OF|3B|SS|1B|OF|2B) ([A-Z][A-Za-z]+ [A-Z][A-Za-z]+)" )
[[1]]
[,1] [,2] [,3]
[1,] "C James McCann" "C" "James McCann"
[2,] "P Robbie Ray" "P" "Robbie Ray"
[3,] "P Rafael Montero" "P" "Rafael Montero"
[4,] "OF Giancarlo Stanton" "OF" "Giancarlo Stanton"
[5,] "3B Derek Dietrich" "3B" "Derek Dietrich"
[6,] "SS Miguel Rojas" "SS" "Miguel Rojas"
[7,] "1B Tommy Joseph" "1B" "Tommy Joseph"
[8,] "OF Marcell Ozuna" "OF" "Marcell Ozuna"
[9,] "OF Christian Yelich" "OF" "Christian Yelich"
I made the pattern more restrictive than yours, since mine limits the one or two letters between spaces to only the combinations matching baseball positions. You are going to get a list with items that are matrices for each line. You should probably post a more complex example to support the further processing that will be needed. You will need to use something along the lines of lapply( results, function(x){ as.data.frame(x[ , -1]) })
lapply( results, function(x){ as.data.frame(x[ , -1]) })
[[1]]
V1 V2
1 C James McCann
2 P Robbie Ray
3 P Rafael Montero
4 OF Giancarlo Stanton
5 3B Derek Dietrich
6 SS Miguel Rojas
7 1B Tommy Joseph
8 OF Marcell Ozuna
9 OF Christian Yelich
If there are going to be hyphenated names or middle names or initials then the pattern may need to be more complex.
Here is a solution which converts lineup
into a string in csv file format which is then read by fread()
:
library(magrittr) # piping used to improve readability
lineup %>%
stringr::str_replace_all("\\s(C|P|OF|SS|1B|2B|3B)\\s", "\\\n\\1;") %>%
data.table::fread(header = FALSE, col.names = c("position", "player"))
position player 1: C James McCann 2: P Robbie Ray 3: P Rafael Montero 4: OF Giancarlo Stanton 5: 3B Derek Dietrich 6: SS Miguel Rojas 7: 1B Tommy Joseph 8: OF Marcell Ozuna 9: 2B C?sar Hern?ndez 10: OF Christian Yelich
The "trick" is to put a line break in front of the position characters and a column separator after, e.g., " C "
becomes "\nC;"
.
lineup %>%
stringr::str_replace_all("\\s(C|P|OF|SS|1B|2B|3B)\\s", "\\\n\\1;")
returns
[1] "\nC;James McCann\nP;Robbie Ray\nP;Rafael Montero\nOF;Giancarlo Stanton\n3B;Derek Dietrich\nSS;Miguel Rojas\n1B;Tommy Joseph\nOF;Marcell Ozuna\n2B;C?sar Hern?ndez\nOF;Christian Yelich"
This approach does not make many assumptions about the names. It even works with names like James P. McCann
or Robbie Ray, Jr
.
lineup2 %>%
stringr::str_replace_all("\\s(C|P|OF|SS|1B|2B|3B)\\s", "\\\n\\1;") %>%
data.table::fread(header = FALSE, col.names = c("position", "player"))
position player 1: C James P. McCann 2: P Robbie Ray, Jr 3: P Rafael D Montero 4: OF Giancarlo Stanton 5: 3B Derek Dietrich 6: SS Miguel Rojas 7: 1B Tommy Joseph 8: OF Marcell Ozuna 9: 2B C?sar Hern?ndez 10: OF Christian Yelich
There are three prerequisites which must be fulfilled:
C
and P
must be completed by a dot to avoid confusion.;
must not be used elsewhere in lineup
.Condition 3 can be waved with an improved regular expression and condition 2 can be checked for:
lineup3 %T>%
{stopifnot(!stringr::str_detect(., ";"))} %>%
stringr::str_replace_all("(^\\s?|\\s)(C|P|OF|SS|1B|2B|3B)\\s", "\\\n\\2;") %>%
data.table::fread(header = FALSE, col.names = c("position", "player"))
position player 1: C James P. McCann 2: P Robbie Ray, Jr 3: P Rafael Montero 4: OF Giancarlo Stanton 5: 3B Derek Dietrich 6: SS Miguel Rojas 7: 1B Tommy Joseph 8: OF Marcell Ozuna 9: 2B C?sar Hern?ndez 10: OF Christian Yelich
# original
lineup = " C James McCann P Robbie Ray P Rafael Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"
# other use cases
lineup1 = "C James McCann P Robbie Ray P Rafael Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"
lineup2 = " C James P. McCann P Robbie Ray, Jr P Rafael D Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"
lineup2a = " C James P. McCann P Robbie Ray P Rafael Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"
lineup2b = " C James McCann P Robbie Ray, Jr P Rafael Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"
lineup3 = "C James P. McCann P Robbie Ray, Jr P Rafael Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"
lineup4 = " C James P. McCann P Robbie Ray; Jr P Rafael Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With