R: How to read in a PGN as a Data Frame

Question

I have a single .pgn Portable Game Notation of a large number of chess games. The games are contained in the file like this:

[Event "4th Bayern-chI Bank Hofmann"]
[Site "?"]
[Date "2000.10.29"]
[Round "?"]
[White "Carlsen, Magnus"]
[Black "Cordts, Ingo"]
[ECO "A56"]
[WhiteElo "0"]
[BlackElo "2222"]
[Result "0-1"]

1. d4 Nf6 2. c4 c5 3. Nf3 cxd4 4. Nxd4 e5 5. Nb5 d5 6. cxd5 Bc5 7. N5c3 O-O 8. e3 e4 9. h3 Re8 10. g4 Re5 11. Bc4 Nbd7 12. Qb3 Ne8 13. Nd2 Nd6 14. Be2 Qh4 15. Nc4 Nxc4 16. Qxc4 b5 17. Qxb5 Rb8 18. Qa4 Nf6 19. Qc6 Nd7 20. d6 Re6 21. Nxe4 Bb7 22. Qxd7 Bxe4 23. Rh2 Bxd6 24. Bc4 Rd8 25. Qxa7 Bxh2 26. Bxe6 fxe6 27. Qa6 Bf3 28. Bd2 Qxh3 29. Qxe6+ Kh8 30. Qe7 Bc7 

0-1


[Event "4th Bayern-chI Bank Hofmann"]
[Site "?"]
[Date "2000.10.30"]
[Round "?"]
[White "Kaiser, Guenter"]
[Black "Carlsen, Magnus"]
[ECO "A46"]
[WhiteElo "0"]
[BlackElo "0"]
[Result "0-1"]

1. d4 Nf6 2. Nf3 d6 3. Nc3 g6 4. e4 Bg7 5. Be2 O-O 6. O-O e5 7. Be3 h6 8. Qd2 Ng4 9. d5 f5 10. exf5 gxf5 11. h3 Nxe3 12. Qxe3 e4 13. Nd4 Qe7 14. Rad1 c5 15. dxc6 bxc6 16. Bc4+ Kh7 17. Nce2 d5 18. Bb3 c5 19. Nb5 d4 20. Qd2 Bb7 21. Nf4 a6 22. Nd5 Qe5 23. Nbc7 Ra7 24. Qa5 Nd7 25. g3 Rc8 26. Nb5 Raa8 27. Nbc7 Bxd5 28. Nxa8 Rxa8 29. Ba4 Be6 30. Kh2 f4 31. Qe1 Nf6 32. Bc6 Rc8 33. Bb7 Rc7 34. Ba8 Bd5 35. Bxd5 Nxd5 36. Qe2 fxg3+ 

0-1

I would like to read in this data as a DataFrame where the column titles are simply the word to the left of the string in quotation marks and the row values are whatever is in the quotation marks. Another column would contain a string of all the moves.

I am completely new to R and simply cannot figure out how to read in a file that is not already in some known format.

readLines() looks promising.

Jota · Accepted Answer

Try this:

pgn <- read.table("your_file.pgn", quote="", sep="\n", stringsAsFactors=FALSE)

# get column names
colnms <- sub("$$(\w+).+", "\1", pgn[1:12,1])
# give columns 11 (the moves) and 12 (redundant results column) nice names
colnms[11] <- "Moves"
colnms[12] <- "Results2"

pgn.df <- data.frame(matrix(sub("\[\w+ \\"(.+)\\"$$", "\1", pgn[,1]),
                     byrow=TRUE, ncol=12))

names(pgn.df) <- colnms

This solution assumes each game is 12 lines, as in your example. If games take up a variable numbers of lines, this solution won't work.

Explanation of the regex lines (see `?regex` for more):

sub("\[(\w+).+", "\1", pgn[1:12,1])

In this regex, we want the first word that follows a square bracket. We have to escape that bracket, as it's a metacharacter. There are other ways to achieve that without using escapes (\), such as by making the [ a character class by putting it inside square brackets: sub("[[](\w+).+", "\1", pgn[1:12,1]).

The parentheses (a capture group) go together with the \1. The \1 as the second argument to sub says to replace the original string with the contents of the first (and only in this case) capture group. Were there to be a 2nd capture group, you'd use \2 to reference it.

The contents of the capture group \w+ are one or more (that's what the + means) word characters (represented by \w). After the () we want to match the rest of the string, which we can do by looking for any character (that's what . means) one or more times (i.e. .+).

So, the regex finds the first square bracket and the first consecutive block of word characters, which we capture, followed by one or more of any other characters.

The second regex: "$$\w+ \\"(.+)\\"$$"

Let's look at the first entry of pgn[,1]: [1] "[Event \"4th Bayern-chI Bank Hofmann\"]". We start out the same as the first regex, but this time we don't want to capture the first word, we just want to find it followed by a space, and then we want to capture everything between the two sets of \".

Both \ and " have to be escaped, so we have a pair of \\" surrounding a capture group that looks for any character one or more times (.+), and finally we have a square bracket, which we escape the same way as the first square bracket. If we didn't escape the ", R would think that was the end of the first argument to sub, and not interpret the " as a literal quote.

In the case of entries like line 11 and 12, nothing is matched because neither line starts with a [, and so, nothing is substituted. We just get the original string back in its entirety.

W7GVR · Answer

Here's what I'd try:

con = file("pgn_file.txt", "r")
all_lines = readLines(con)
close(con);

res = list();
for(this_line in all_lines)
  {
  if(grepl("^\s*$", this_line, perl=T))
    {
    print("Empty line: do nothing")
    }else
    {
    if(grepl("^$$", this_line, perl=T))
      {
      field = gsub("^\[\s*([a-zA-Z]+)\s*\"([a-zA-Z0-9\s.?, -]+)\"$$$", "\1", this_line);
      value = gsub("^$$\s*([a-zA-Z]+)\s*\"([a-zA-Z0-9\s.?, -]+)\"$$$", "\2", this_line);
      print(field);
      res[[tolower(field)]] = c(res[[tolower(field)]], value);
      }else
      {
      print(this_line)
      }
    if(grepl("^1\.", this_line, perl=T))
      {
      res[["move_list"]] = c(res[["move_list"]], this_line);
      }
    }
  }
res = as.data.frame(res);

R: How to read in a PGN as a Data Frame

Tags:

r

Parseltongue

2 Answers

Explanation of the regex lines (see `?regex` for more):

Jota

W7GVR

Recent Activity

Donate For Us

R: How to read in a PGN as a Data Frame

Tags:

r

Parseltongue

2 Answers

Explanation of the regex lines (see ?regex for more):

Jota

W7GVR

Related questions

Recent Activity

Donate For Us

Explanation of the regex lines (see `?regex` for more):