Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Break Apart a String into Separate Columns R

I am trying to tidy up some data that is all contained in 1 column called "game_info" as a string. This data contains college basketball upcoming game data, with the Date, Time, Team IDs, Team Names, etc. Ideally each one of those would be their own column. I have tried separating with a space delimiter, but that has not worked well since there are teams such as "Duke" with 1 part to their name, and teams with 2 to 3 parts to their name (Michigan State, South Dakota State, etc). There also teams with "-" dashes in their name.

Here is my data:

df <- data.frame(list(
  game_info = c(
    "12/16 7:00 PM 751 Appalachian State 752 Duke",
    "12/16 7:00 PM 753 Chicago State 754 Indiana-Purdue",
    "12/16 8:00 PM 755 Texas-Arlington 756 Oral Roberts", 
    "12/16 10:00 PM 757 Dartmouth 758 Stanford"
    )
  ))

Desired output:

date  time     away_team_id  away_team_name     home_team_id home_team_name
12/16 7:00 PM    751         Appalachian State  752          Duke
12/16 7:00 PM    753         Chicago State      754          Indiana-Purdue
12/16 8:00 PM    755         Texas-Arlington    756          Oral Roberts
12/16 10:00 PM   757         Dartmouth          758          Stanford

@Jonny Phelps @doRemy

enter image description here

like image 552
bodega18 Avatar asked Jan 24 '23 04:01

bodega18


2 Answers

A simple way is to use the extract from the dplyr library with a regex expression:

# Define the column names:
column_names <- c("date", "time", "away_team_id", "away_team_name", "home_team_id", "home_team_name")
# Define the regex expression:
regex_expr <- paste(
  "([0-9]{1,2}[/][0-9]{1,2})", # The date
  "([0-9]{1,2}:[0-9]{1,2} [A-Za-z]{2})", # The time
  "([0-9]+)", # The away team id
  "([A-Za-z -]+)", # The away team name
  "([0-9]+)", # The home team id
  "([A-Za-z -]+)" # The home team name
)
# Extract the columns:
df %>% extract(col = game_info, into = column_names, regex = regex_expr)
like image 61
DoRemy95 Avatar answered Jan 30 '23 09:01

DoRemy95


Here's one with regex. See regex101 link for the regex explanations

regex <- "^(\\d{2}\\/\\d{2})\\s*(\\d{1,2}:\\d{2}\\s*(PM|AM))\\s*(\\d+)\\s*([^\\d.]+)(\\d+)\\s*([^\\d.]+)$"

data <- data.frame(game_info=
  "12/16 7:00 PM 751 Appalachian State 752 Duke"
  ,"12/16 7:00 PM 753 Chicago State 754 Indiana-Purdue"
  ,"12/16 8:00 PM 755 Texas-Arlington 756 Oral Roberts"
  ,"12/16 10:00 AM 757 Dartmouth 758 Stanford"
)
library(stringr)

out <- do.call(rbind, str_match_all(data, regex))
out <- as.data.frame(out)
# remove full string & AM/PM
out$V1 <- NULL
out$V4 <- NULL
names(out) <- c("date", "time", "away_team_id", "away_team_name",
                "home_team_id", "home_team_name")
# remove white space from end
out$away_team_name <- trimws(out$away_team_name)
out$home_team_name <- trimws(out$home_team_name)
out

Explanation:

^(\d{2}/\d{2}) - starts with 2 digits/2 digits like 12/16. ^ is a start anchor and () are used to say we want to capture this group for plucking out

\s* - 0 or more spaces between our first group and the next

(\d{1,2}:\d{2}\s*(PM|AM)) - want 1 or 2 digits : 2 digits, then possibly a space and PM or AM

\s*(\d+)\s* - spaces around any number of digits, the first id

([^\d.]+) - all non numeric characters. This will fall down if there are ever numbers in your team names. If so, find some examples and we can improve it. White space is captured afterwards so is removed later with trimws

(\d+)\s* - second id and spaces

([^\d.]+)$ - finally the other team name and the end sentence anchor

like image 24
Jonny Phelps Avatar answered Jan 30 '23 09:01

Jonny Phelps