I am trying to tidy up some data that is all contained in 1 column called "game_info" as a string. This data contains college basketball upcoming game data, with the Date, Time, Team IDs, Team Names, etc. Ideally each one of those would be their own column. I have tried separating with a space delimiter, but that has not worked well since there are teams such as "Duke" with 1 part to their name, and teams with 2 to 3 parts to their name (Michigan State, South Dakota State, etc). There also teams with "-" dashes in their name.
Here is my data:
df <- data.frame(list(
game_info = c(
"12/16 7:00 PM 751 Appalachian State 752 Duke",
"12/16 7:00 PM 753 Chicago State 754 Indiana-Purdue",
"12/16 8:00 PM 755 Texas-Arlington 756 Oral Roberts",
"12/16 10:00 PM 757 Dartmouth 758 Stanford"
)
))
Desired output:
date time away_team_id away_team_name home_team_id home_team_name
12/16 7:00 PM 751 Appalachian State 752 Duke
12/16 7:00 PM 753 Chicago State 754 Indiana-Purdue
12/16 8:00 PM 755 Texas-Arlington 756 Oral Roberts
12/16 10:00 PM 757 Dartmouth 758 Stanford
@Jonny Phelps @doRemy
A simple way is to use the extract
from the dplyr
library with a regex expression:
# Define the column names:
column_names <- c("date", "time", "away_team_id", "away_team_name", "home_team_id", "home_team_name")
# Define the regex expression:
regex_expr <- paste(
"([0-9]{1,2}[/][0-9]{1,2})", # The date
"([0-9]{1,2}:[0-9]{1,2} [A-Za-z]{2})", # The time
"([0-9]+)", # The away team id
"([A-Za-z -]+)", # The away team name
"([0-9]+)", # The home team id
"([A-Za-z -]+)" # The home team name
)
# Extract the columns:
df %>% extract(col = game_info, into = column_names, regex = regex_expr)
Here's one with regex. See regex101 link for the regex explanations
regex <- "^(\\d{2}\\/\\d{2})\\s*(\\d{1,2}:\\d{2}\\s*(PM|AM))\\s*(\\d+)\\s*([^\\d.]+)(\\d+)\\s*([^\\d.]+)$"
data <- data.frame(game_info=
"12/16 7:00 PM 751 Appalachian State 752 Duke"
,"12/16 7:00 PM 753 Chicago State 754 Indiana-Purdue"
,"12/16 8:00 PM 755 Texas-Arlington 756 Oral Roberts"
,"12/16 10:00 AM 757 Dartmouth 758 Stanford"
)
library(stringr)
out <- do.call(rbind, str_match_all(data, regex))
out <- as.data.frame(out)
# remove full string & AM/PM
out$V1 <- NULL
out$V4 <- NULL
names(out) <- c("date", "time", "away_team_id", "away_team_name",
"home_team_id", "home_team_name")
# remove white space from end
out$away_team_name <- trimws(out$away_team_name)
out$home_team_name <- trimws(out$home_team_name)
out
Explanation:
^(\d{2}/\d{2}) - starts with 2 digits/2 digits like 12/16. ^ is a start anchor and () are used to say we want to capture this group for plucking out
\s* - 0 or more spaces between our first group and the next
(\d{1,2}:\d{2}\s*(PM|AM)) - want 1 or 2 digits : 2 digits, then possibly a space and PM or AM
\s*(\d+)\s* - spaces around any number of digits, the first id
([^\d.]+) - all non numeric characters. This will fall down if there are ever numbers in your team names. If so, find some examples and we can improve it. White space is captured afterwards so is removed later with trimws
(\d+)\s* - second id and spaces
([^\d.]+)$ - finally the other team name and the end sentence anchor
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With