R efficiency challenge: Splitting a long character vector

Question

The problem is to efficiently parse data of this format:

lineup = " C James McCann P Robbie Ray P Rafael Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"

into a dataframe of two columns; one for the position, and one for the player.

The names are baseball players, and each name is prefaced with their position, which is the exact set {C, P, P, OF, 3B, SS, 1B, OF, 2B, OF} in some order. That is, those exact positions always occur.

For example, "C James McCann" should turn into

data.frame(position = "C", player = "James McCann")

In reality, I have many hundreds of thousands of such strings, and I want to parse them efficiently. Here is my inefficient solution:

data.frame(
    position = str_match_all(lineup, "\s[0-9A-Z]{1,2}\s")[[1]] %>% as.character() %>% str_trim(),
    player = str_split(lineup, "\s[0-9A-Z]{1,2}\s")[[1]][-1],
    stringsAsFactors = F
)

This tidyverse solution is simple, but I suspect I can do much better. Does anyone have any ideas?

IRTFM · Accepted Answer

You could make a single pattern that would get you both the position and the player name with stringi::stri_match_all_regex:

stri_match_all_regex(lineup, 
                   patt= "(C|P|OF|3B|SS|1B|OF|2B) ([A-Z][A-Za-z]+ [A-Z][A-Za-z]+)" )
[[1]]
      [,1]                   [,2] [,3]               
 [1,] "C James McCann"       "C"  "James McCann"     
 [2,] "P Robbie Ray"         "P"  "Robbie Ray"       
 [3,] "P Rafael Montero"     "P"  "Rafael Montero"   
 [4,] "OF Giancarlo Stanton" "OF" "Giancarlo Stanton"
 [5,] "3B Derek Dietrich"    "3B" "Derek Dietrich"   
 [6,] "SS Miguel Rojas"      "SS" "Miguel Rojas"     
 [7,] "1B Tommy Joseph"      "1B" "Tommy Joseph"     
 [8,] "OF Marcell Ozuna"     "OF" "Marcell Ozuna"    
 [9,] "OF Christian Yelich"  "OF" "Christian Yelich"

I made the pattern more restrictive than yours, since mine limits the one or two letters between spaces to only the combinations matching baseball positions. You are going to get a list with items that are matrices for each line. You should probably post a more complex example to support the further processing that will be needed. You will need to use something along the lines of lapply( results, function(x){ as.data.frame(x[ , -1]) })

lapply( results, function(x){ as.data.frame(x[ , -1]) })
[[1]]
  V1                V2
1  C      James McCann
2  P        Robbie Ray
3  P    Rafael Montero
4 OF Giancarlo Stanton
5 3B    Derek Dietrich
6 SS      Miguel Rojas
7 1B      Tommy Joseph
8 OF     Marcell Ozuna
9 OF  Christian Yelich

If there are going to be hyphenated names or middle names or initials then the pattern may need to be more complex.

Uwe · Answer

Here is a solution which converts lineup into a string in csv file format which is then read by fread():

library(magrittr)  # piping used to improve readability
lineup %>% 
  stringr::str_replace_all("\s(C|P|OF|SS|1B|2B|3B)\s", "\
\1;") %>% 
  data.table::fread(header = FALSE, col.names = c("position", "player"))

    position            player
 1:        C      James McCann
 2:        P        Robbie Ray
 3:        P    Rafael Montero
 4:       OF Giancarlo Stanton
 5:       3B    Derek Dietrich
 6:       SS      Miguel Rojas
 7:       1B      Tommy Joseph
 8:       OF     Marcell Ozuna
 9:       2B   C?sar Hern?ndez
10:       OF  Christian Yelich

The "trick" is to put a line break in front of the position characters and a column separator after, e.g., " C " becomes " C;".

lineup %>% 
  stringr::str_replace_all("\s(C|P|OF|SS|1B|2B|3B)\s", "\
\1;")

returns

[1] "
C;James McCann
P;Robbie Ray
P;Rafael Montero
OF;Giancarlo  Stanton
3B;Derek Dietrich
SS;Miguel Rojas
1B;Tommy Joseph
OF;Marcell Ozuna
2B;C?sar Hern?ndez
OF;Christian Yelich"

This approach does not make many assumptions about the names. It even works with names like James P. McCann or Robbie Ray, Jr.

lineup2 %>% 
  stringr::str_replace_all("\s(C|P|OF|SS|1B|2B|3B)\s", "\
\1;") %>% 
  data.table::fread(header = FALSE, col.names = c("position", "player"))

    position            player
 1:        C   James P. McCann
 2:        P    Robbie Ray, Jr
 3:        P  Rafael D Montero
 4:       OF Giancarlo Stanton
 5:       3B    Derek Dietrich
 6:       SS      Miguel Rojas
 7:       1B      Tommy Joseph
 8:       OF     Marcell Ozuna
 9:       2B   C?sar Hern?ndez
10:       OF  Christian Yelich

There are three prerequisites which must be fulfilled:

The name part must not contain any initials which are also used as position indicators, e.g., initials C and P must be completed by a dot to avoid confusion.
The column separator ; must not be used elsewhere in lineup.
The string must start with a leading space.

Condition 3 can be waved with an improved regular expression and condition 2 can be checked for:

lineup3 %T>% 
  {stopifnot(!stringr::str_detect(., ";"))} %>% 
  stringr::str_replace_all("(^\s?|\s)(C|P|OF|SS|1B|2B|3B)\s", "\
\2;") %>% 
  data.table::fread(header = FALSE, col.names = c("position", "player"))

    position            player
 1:        C   James P. McCann
 2:        P    Robbie Ray, Jr
 3:        P    Rafael Montero
 4:       OF Giancarlo Stanton
 5:       3B    Derek Dietrich
 6:       SS      Miguel Rojas
 7:       1B      Tommy Joseph
 8:       OF     Marcell Ozuna
 9:       2B   C?sar Hern?ndez
10:       OF  Christian Yelich

Data

# original
lineup = " C James McCann P Robbie Ray P Rafael Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"

# other use cases
lineup1 = "C James McCann P Robbie Ray P Rafael Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"
lineup2 = " C James P. McCann P Robbie Ray, Jr P Rafael D Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"
lineup2a = " C James P. McCann P Robbie Ray P Rafael Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"
lineup2b = " C James McCann P Robbie Ray, Jr P Rafael Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"
lineup3 = "C James P. McCann P Robbie Ray, Jr P Rafael Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"
lineup4 = " C James P. McCann P Robbie Ray; Jr P Rafael Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"

R efficiency challenge: Splitting a long character vector

Tags:

performance

string

regex

r

ThanksABundle

2 Answers

IRTFM

Data

Uwe

Recent Activity

Donate For Us

R efficiency challenge: Splitting a long character vector

Tags:

performance

string

regex

r

ThanksABundle

2 Answers

IRTFM

Data

Uwe

Related questions

Recent Activity

Donate For Us