Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R efficiency challenge: Splitting a long character vector

The problem is to efficiently parse data of this format:

lineup = " C James McCann P Robbie Ray P Rafael Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"

into a dataframe of two columns; one for the position, and one for the player.

The names are baseball players, and each name is prefaced with their position, which is the exact set {C, P, P, OF, 3B, SS, 1B, OF, 2B, OF} in some order. That is, those exact positions always occur.

For example, "C James McCann" should turn into

data.frame(position = "C", player = "James McCann")

In reality, I have many hundreds of thousands of such strings, and I want to parse them efficiently. Here is my inefficient solution:

data.frame(
    position = str_match_all(lineup, "\\s[0-9A-Z]{1,2}\\s")[[1]] %>% as.character() %>% str_trim(),
    player = str_split(lineup, "\\s[0-9A-Z]{1,2}\\s")[[1]][-1],
    stringsAsFactors = F
)

This tidyverse solution is simple, but I suspect I can do much better. Does anyone have any ideas?

like image 870
ThanksABundle Avatar asked Mar 05 '23 07:03

ThanksABundle


2 Answers

You could make a single pattern that would get you both the position and the player name with stringi::stri_match_all_regex:

stri_match_all_regex(lineup, 
                   patt= "(C|P|OF|3B|SS|1B|OF|2B) ([A-Z][A-Za-z]+ [A-Z][A-Za-z]+)" )
[[1]]
      [,1]                   [,2] [,3]               
 [1,] "C James McCann"       "C"  "James McCann"     
 [2,] "P Robbie Ray"         "P"  "Robbie Ray"       
 [3,] "P Rafael Montero"     "P"  "Rafael Montero"   
 [4,] "OF Giancarlo Stanton" "OF" "Giancarlo Stanton"
 [5,] "3B Derek Dietrich"    "3B" "Derek Dietrich"   
 [6,] "SS Miguel Rojas"      "SS" "Miguel Rojas"     
 [7,] "1B Tommy Joseph"      "1B" "Tommy Joseph"     
 [8,] "OF Marcell Ozuna"     "OF" "Marcell Ozuna"    
 [9,] "OF Christian Yelich"  "OF" "Christian Yelich" 

I made the pattern more restrictive than yours, since mine limits the one or two letters between spaces to only the combinations matching baseball positions. You are going to get a list with items that are matrices for each line. You should probably post a more complex example to support the further processing that will be needed. You will need to use something along the lines of lapply( results, function(x){ as.data.frame(x[ , -1]) })

lapply( results, function(x){ as.data.frame(x[ , -1]) })
[[1]]
  V1                V2
1  C      James McCann
2  P        Robbie Ray
3  P    Rafael Montero
4 OF Giancarlo Stanton
5 3B    Derek Dietrich
6 SS      Miguel Rojas
7 1B      Tommy Joseph
8 OF     Marcell Ozuna
9 OF  Christian Yelich

If there are going to be hyphenated names or middle names or initials then the pattern may need to be more complex.

like image 197
IRTFM Avatar answered Mar 16 '23 22:03

IRTFM


Here is a solution which converts lineup into a string in csv file format which is then read by fread():

library(magrittr)  # piping used to improve readability
lineup %>% 
  stringr::str_replace_all("\\s(C|P|OF|SS|1B|2B|3B)\\s", "\\\n\\1;") %>% 
  data.table::fread(header = FALSE, col.names = c("position", "player"))
    position            player
 1:        C      James McCann
 2:        P        Robbie Ray
 3:        P    Rafael Montero
 4:       OF Giancarlo Stanton
 5:       3B    Derek Dietrich
 6:       SS      Miguel Rojas
 7:       1B      Tommy Joseph
 8:       OF     Marcell Ozuna
 9:       2B   C?sar Hern?ndez
10:       OF  Christian Yelich

The "trick" is to put a line break in front of the position characters and a column separator after, e.g., " C " becomes "\nC;".

lineup %>% 
  stringr::str_replace_all("\\s(C|P|OF|SS|1B|2B|3B)\\s", "\\\n\\1;")

returns

[1] "\nC;James McCann\nP;Robbie Ray\nP;Rafael Montero\nOF;Giancarlo  Stanton\n3B;Derek Dietrich\nSS;Miguel Rojas\n1B;Tommy Joseph\nOF;Marcell Ozuna\n2B;C?sar Hern?ndez\nOF;Christian Yelich"

This approach does not make many assumptions about the names. It even works with names like James P. McCann or Robbie Ray, Jr.

lineup2 %>% 
  stringr::str_replace_all("\\s(C|P|OF|SS|1B|2B|3B)\\s", "\\\n\\1;") %>% 
  data.table::fread(header = FALSE, col.names = c("position", "player"))
    position            player
 1:        C   James P. McCann
 2:        P    Robbie Ray, Jr
 3:        P  Rafael D Montero
 4:       OF Giancarlo Stanton
 5:       3B    Derek Dietrich
 6:       SS      Miguel Rojas
 7:       1B      Tommy Joseph
 8:       OF     Marcell Ozuna
 9:       2B   C?sar Hern?ndez
10:       OF  Christian Yelich

There are three prerequisites which must be fulfilled:

  1. The name part must not contain any initials which are also used as position indicators, e.g., initials C and P must be completed by a dot to avoid confusion.
  2. The column separator ; must not be used elsewhere in lineup.
  3. The string must start with a leading space.

Condition 3 can be waved with an improved regular expression and condition 2 can be checked for:

lineup3 %T>% 
  {stopifnot(!stringr::str_detect(., ";"))} %>% 
  stringr::str_replace_all("(^\\s?|\\s)(C|P|OF|SS|1B|2B|3B)\\s", "\\\n\\2;") %>% 
  data.table::fread(header = FALSE, col.names = c("position", "player"))
    position            player
 1:        C   James P. McCann
 2:        P    Robbie Ray, Jr
 3:        P    Rafael Montero
 4:       OF Giancarlo Stanton
 5:       3B    Derek Dietrich
 6:       SS      Miguel Rojas
 7:       1B      Tommy Joseph
 8:       OF     Marcell Ozuna
 9:       2B   C?sar Hern?ndez
10:       OF  Christian Yelich

Data

# original
lineup = " C James McCann P Robbie Ray P Rafael Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"

# other use cases
lineup1 = "C James McCann P Robbie Ray P Rafael Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"
lineup2 = " C James P. McCann P Robbie Ray, Jr P Rafael D Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"
lineup2a = " C James P. McCann P Robbie Ray P Rafael Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"
lineup2b = " C James McCann P Robbie Ray, Jr P Rafael Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"
lineup3 = "C James P. McCann P Robbie Ray, Jr P Rafael Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"
lineup4 = " C James P. McCann P Robbie Ray; Jr P Rafael Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"
like image 24
Uwe Avatar answered Mar 16 '23 22:03

Uwe