Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to transform a sequential string of chess moves into a vertical dataframe?

I have a dataframe of chess games with two columns as shown below

dd <- data.frame(
  game_id = c(101,102),
  moves = c("1.e4 c5 2.Nf3 d6 3.d4 cxd4 4.Nxd4 Nf6 5.Nc3 Nc6 6.Bc4 e6 7.Be3 Be7","1.e3 c5 2.Nf3 Nc6 3.d4 cxd4 4.Nxd4 Nf6 5.Nc3 e5 6.Ndb5 d6")  
)

Here each row is a separate game uniquely identified by the game id. The moves column contain all the moves of a game in sequential order from left to right. The serial number of the move can be identified by the number just before each dot ".". Each move has two parts; the first part is always the move by the white player followed by the second part which is the move by the black player. The two parts are separated by a single space. As shown in the above data, two consecutive moves are also separated by a single space, however, there is no gap between the dot of the serial number and the first character of the white player's move. The total number of moves in a game is arbitrary as some games end in a few moves while others may have many moves.

Question: As we can see all the moves of a game are present in one single cell of the dataframe which is not very easy for analysis. I want to convert this to a dataframe with a better structure as shown below:

game_id  | move_no | white | black
----------------------------------
    101  | 1       | e4    | c5
    101  | 2       | Nf3   | d6
    101  | 3       | d4    | cxd4
    101  | 4       | Nxd4  | Nf6 

How can this be done in R?

like image 999
Stacker Avatar asked Nov 29 '25 20:11

Stacker


2 Answers

We can splot the move string with a regular expression. Here I've used stringr::str_match_all to capture each part of the moves.

dd$moves |>
  stringr::str_match_all(r"{(\d+)\.(\S+) (\S+)}") |>
  lapply(function(x) data.frame(move_id=as.numeric(x[,2]), white=x[,3], black=x[,4])) |> 
  Map(cbind.data.frame, game_id=dd$game_id, m=_) |>
  do.call("rbind", args=_)

which will return

   game_id m.move_id m.white m.black
1      101         1      e4      c5
2      101         2     Nf3      d6
3      101         3      d4    cxd4
4      101         4    Nxd4     Nf6
5      101         5     Nc3     Nc6
6      101         6     Bc4      e6
7      101         7     Be3     Be7
8      102         1      e3      c5
9      102         2     Nf3     Nc6

The main part is the regular expression r"{(\d+)\.(\S+) (\S+)}" which finds a number followed by a period, then tries to find two non-space-containing piece names with a space between them.

like image 169
MrFlick Avatar answered Dec 02 '25 10:12

MrFlick


with base R

Data

x <- "
game_id  | moves
101      | 1.e4 c5 2.Nf3 d6 3.d4 cxd4 4.Nxd4 Nf6 5.Nc3 Nc6 6.Bc4 e6 7.Be3 Be7 
102      | 1.e3 c5 2.Nf3 Nc6 3.d4 cxd4 4.Nxd4 Nf6 5.Nc3 e5 6.Ndb5 d6 
"
df <- read.table(textConnection(x) , header = T , sep = "|")

using fn function


fn <- function(df) {
  lst <- list()
  id <- 1 ; L <- 1
  clmn <- strsplit(trimws(df$moves) , "[. ]")
  for (i in clmn) {
    for (j in 1:(length(i) / 3)) {
      j <- 3*j - 2
      lst[[id]] <- c(df$game_id[[L]] , clmn[[L]][j:(j + 2)])
      id <- id + 1
    }
    L <- L + 1
  }
  lst
}
#===================================

df <- data.frame(do.call(rbind , fn(df)))
colnames(df) <- c("game_id" , "move_no" , "white" , "black")

Output

df
#>    game_id move_no white black
#> 1      101       1    e4    c5
#> 2      101       2   Nf3    d6
#> 3      101       3    d4  cxd4
#> 4      101       4  Nxd4   Nf6
#> 5      101       5   Nc3   Nc6
#> 6      101       6   Bc4    e6
#> 7      101       7   Be3   Be7
#> 8      102       1    e3    c5
#> 9      102       2   Nf3   Nc6
#> 10     102       3    d4  cxd4
#> 11     102       4  Nxd4   Nf6
#> 12     102       5   Nc3    e5
#> 13     102       6  Ndb5    d6

Created on 2022-06-16 by the reprex package (v2.0.1)

like image 31
Mohamed Desouky Avatar answered Dec 02 '25 09:12

Mohamed Desouky