Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Excessive depth in document: XML_PARSE_HUGE option for xml2::read_html() in R

First I would like to apologize for a new question as my profile does not yet allow me to comment on other people's comments especially on two SO posts I've seen. So please bear with this older guy :-)

I am trying to read a list of 100 character files ranging in size from around 90KB to 2MB and then using the qdap package do some statistics with the text I extract from the files namely count sentences, words etc. The files contain webpage source previously scraped using RSelenium::remoteDriver$getPageSource() and saved to file using write(pgSource, fileName.txt). I am reading the files in a loop using:

pgSource <- readChar(file.path(fPath, fileNames[i]), nchars = 1e6)
doc <- read_html(pgSource)

that for some files is throwing

Error in eval(substitute(expr), envir, enclos) : 
  Excessive depth in document: 256 use XML_PARSE_HUGE option [1] 

I have seen these posts, SO33819103 and SO31419409 that point to similar problems but cannot fully understand how to use @shabbychef's workaround as suggested in both posts using the snippet suggested by @glossarch in the first link above.

library(drat)
drat:::add("shabbychef");
install.packages('xml2')
library("xml2")

EDIT: I noticed that when previously I was running another script scraping the data live from the webpages using URL's I did not encounter this problem. The code was the same, I was just reading the doc <- read_html(pgSource) after reading it from the RSelenium's remoteDriver.

What I would like to ask this gentle community is whether I am following the right steps in installing and loading xml2 after adding shabbychef's drat or whether I need to add some other step as suggested in SO17154308 post. Any help or suggestions are greatly appreciated. Thank you.

like image 989
salvu Avatar asked Sep 24 '16 07:09

salvu


1 Answers

I don't know if this is the right thing to do, but my question was answered by @hrbrmstr in one of his comments. I decided to post an answer so that people stumbling upon this question see that it has at least one answer.

The problem is basically solved by using the "HUGE" option when reading the html source. My problem was only related to when I loaded previously saved source. I did not find the same problem whilst using a "live" version of the application, i.e. reading the source from the website directly.

Anyway, now the August 2016 update of the excellent xml2 package permits the use of the HUGE option as follows:

doc <- read_html(pageSource, options = "HUGE")

For more information please read the xml2 reference manual here CRAN-xml2

I wish to thank @hrbrmstr again for his valuable contribution.

like image 186
salvu Avatar answered Nov 19 '22 17:11

salvu