I have a dataframe of chess games with two columns as shown below
dd <- data.frame(
game_id = c(101,102),
moves = c("1.e4 c5 2.Nf3 d6 3.d4 cxd4 4.Nxd4 Nf6 5.Nc3 Nc6 6.Bc4 e6 7.Be3 Be7","1.e3 c5 2.Nf3 Nc6 3.d4 cxd4 4.Nxd4 Nf6 5.Nc3 e5 6.Ndb5 d6")
)
Here each row is a separate game uniquely identified by the game id. The moves column contain all the moves of a game in sequential order from left to right. The serial number of the move can be identified by the number just before each dot ".". Each move has two parts; the first part is always the move by the white player followed by the second part which is the move by the black player. The two parts are separated by a single space. As shown in the above data, two consecutive moves are also separated by a single space, however, there is no gap between the dot of the serial number and the first character of the white player's move. The total number of moves in a game is arbitrary as some games end in a few moves while others may have many moves.
Question: As we can see all the moves of a game are present in one single cell of the dataframe which is not very easy for analysis. I want to convert this to a dataframe with a better structure as shown below:
game_id | move_no | white | black
----------------------------------
101 | 1 | e4 | c5
101 | 2 | Nf3 | d6
101 | 3 | d4 | cxd4
101 | 4 | Nxd4 | Nf6
How can this be done in R?
We can splot the move string with a regular expression. Here I've used stringr::str_match_all to capture each part of the moves.
dd$moves |>
stringr::str_match_all(r"{(\d+)\.(\S+) (\S+)}") |>
lapply(function(x) data.frame(move_id=as.numeric(x[,2]), white=x[,3], black=x[,4])) |>
Map(cbind.data.frame, game_id=dd$game_id, m=_) |>
do.call("rbind", args=_)
which will return
game_id m.move_id m.white m.black
1 101 1 e4 c5
2 101 2 Nf3 d6
3 101 3 d4 cxd4
4 101 4 Nxd4 Nf6
5 101 5 Nc3 Nc6
6 101 6 Bc4 e6
7 101 7 Be3 Be7
8 102 1 e3 c5
9 102 2 Nf3 Nc6
The main part is the regular expression r"{(\d+)\.(\S+) (\S+)}" which finds a number followed by a period, then tries to find two non-space-containing piece names with a space between them.
with base R
Data
x <- "
game_id | moves
101 | 1.e4 c5 2.Nf3 d6 3.d4 cxd4 4.Nxd4 Nf6 5.Nc3 Nc6 6.Bc4 e6 7.Be3 Be7
102 | 1.e3 c5 2.Nf3 Nc6 3.d4 cxd4 4.Nxd4 Nf6 5.Nc3 e5 6.Ndb5 d6
"
df <- read.table(textConnection(x) , header = T , sep = "|")
using fn function
fn <- function(df) {
lst <- list()
id <- 1 ; L <- 1
clmn <- strsplit(trimws(df$moves) , "[. ]")
for (i in clmn) {
for (j in 1:(length(i) / 3)) {
j <- 3*j - 2
lst[[id]] <- c(df$game_id[[L]] , clmn[[L]][j:(j + 2)])
id <- id + 1
}
L <- L + 1
}
lst
}
#===================================
df <- data.frame(do.call(rbind , fn(df)))
colnames(df) <- c("game_id" , "move_no" , "white" , "black")
Output
df
#> game_id move_no white black
#> 1 101 1 e4 c5
#> 2 101 2 Nf3 d6
#> 3 101 3 d4 cxd4
#> 4 101 4 Nxd4 Nf6
#> 5 101 5 Nc3 Nc6
#> 6 101 6 Bc4 e6
#> 7 101 7 Be3 Be7
#> 8 102 1 e3 c5
#> 9 102 2 Nf3 Nc6
#> 10 102 3 d4 cxd4
#> 11 102 4 Nxd4 Nf6
#> 12 102 5 Nc3 e5
#> 13 102 6 Ndb5 d6
Created on 2022-06-16 by the reprex package (v2.0.1)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With