I am using the xml2
package to read a huge XML file into memory and the command fails with the following error:
Error: Char 0x0 out of allowed range [9]
My code looks like the following:
library(xml2)
doc <- read_xml('~/Downloads/FBrf.xml')
The data can be downloaded at ftp://ftp.flybase.net/releases/FB2015_05/reporting-xml/FBrf.xml.gz (about 140MB) and unpacked it has about 1.8GB.
Has anyone some advise how to figure out which characters are problematic or how to clean the file before reading it.
EDIT
Ok, since the file is pretty big I searched for other solutions on stack overflow and try to implement a solution from Martin Morgan which he presented here Combine values in huge XML-files
So what I have done so far is the following lines of code
library(XML)
branchFunction <- function(progress=10) {
res <- new.env(parent=emptyenv()) # for results
it <- 0L # iterator -- nodes visited
list(publication=function(elt) {
## handle 'publication' nodes
if (getNodeSet(elt, "not(/publication/feature/id)"))
## early exit -- no feature id
return(NULL)
it <<- it + 1L
if (it %% progress == 0L)
message(it)
publication <- getNodeSet(elt, "string(/publication/id/text())") # 'key'
res[[publication]] <-
list(miniref=getNodeSet(elt,
"normalize-space(/publication/miniref/text())"),
features= xpathSApply(elt, "//feature/id/text()", xmlValue))
}, getres = function() {
## retrieve the 'res' environment when done
res
}, get=function() {
## retrieve 'res' environment as data.frame
publication <- ls(res)
miniref <- unlist(eapply(res, "[[", "miniref"), use.names=FALSE)
feature <- eapply(res, "[[", "features")
len <- sapply(feature, length)
data.frame(publication=rep(publication, len),
feature=unlist(feature, use.names=FALSE),
miniref=rep(miniref, len))
})
}
branches <- branchFunction()
xmlEventParse("~/Downloads/jnk.xml", handlers=NULL, branches=branches)
# xmlEventParse("~/Downloads/FBrf.xml", handlers=NULL, branches=branches)
branches$get()
I upload the xml file to my server http://download.dejung.net/jnk.xml
The file has only a few kb, but the problem is the result. The second publication entry has an id FBrf0162243 and a miniref of Schwartz et al., 2003, Mol. Cell. Biol. 23(19): 6876--6886
.
My results from the code I posted above reports the wrong publication id to the corresponding miniref. The feature ids are correct....
FBrf0050934 FBgn0003277 Schwartz et al., 2003, Mol. Cell. Biol. 23(19): 6876--6886
Not sure why my code is reporting the wrong values, maybe someone can help me with the closures since this is very new to me.
I occasionally encounter "embedded NULL" error messages that may be similar to this (if the 0x0
in this message means the same NULL
issue). My approach is to try to delete them before reading in the file, as I have not found an R package that ignores them.
If you are on Unix or OS X, you could invoke sed
in your R program via:
system( 'sed "s/\\0//g" ~/Downloads/dirty.xml > ~/Downloads/clean.xml' )
If this doesn't do the trick, you might want to expand this "blacklist" of characters -- see for example Unicode Regex; Invalid XML characters
If something is still wrong then sometimes I make a character whitelist -- delete everything not in the specified character set..
sed 's/[^A-Za-z0-9 _.,"]//g' ~/Downloads/dirty.csv > ~/Downloads/clean.csv
This is the one I use for .csv data files (don't care about </etc.>
), so you'd maybe want to expand it to something like [^[:ascii:]]
:
If you are on Windows, you likely have to go outside of R for this approach -- for example you can use Cygwin instead of the system()
invocation above.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With