Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

problems reading big XML file with xml2 package and trying to create a working closure

Tags:

r

xml

xml2

I am using the xml2 package to read a huge XML file into memory and the command fails with the following error:

Error: Char 0x0 out of allowed range [9]

My code looks like the following:

library(xml2)
doc <- read_xml('~/Downloads/FBrf.xml')

The data can be downloaded at ftp://ftp.flybase.net/releases/FB2015_05/reporting-xml/FBrf.xml.gz (about 140MB) and unpacked it has about 1.8GB.

Has anyone some advise how to figure out which characters are problematic or how to clean the file before reading it.

EDIT

Ok, since the file is pretty big I searched for other solutions on stack overflow and try to implement a solution from Martin Morgan which he presented here Combine values in huge XML-files

So what I have done so far is the following lines of code

library(XML)
branchFunction <- function(progress=10) {
    res <- new.env(parent=emptyenv())   # for results
    it <- 0L                            # iterator -- nodes visited
    list(publication=function(elt) {
        ## handle 'publication' nodes 
        if (getNodeSet(elt, "not(/publication/feature/id)"))
            ## early exit -- no feature id
            return(NULL)
        it <<- it + 1L
        if (it %% progress == 0L)
            message(it)
        publication <- getNodeSet(elt, "string(/publication/id/text())") # 'key'
        res[[publication]] <-
            list(miniref=getNodeSet(elt,
                   "normalize-space(/publication/miniref/text())"),
                 features= xpathSApply(elt, "//feature/id/text()", xmlValue))
    }, getres = function() {
        ## retrieve the 'res' environment when done
        res
    }, get=function() {
        ## retrieve 'res' environment as data.frame
        publication <- ls(res)
        miniref <- unlist(eapply(res, "[[", "miniref"), use.names=FALSE)
        feature <- eapply(res, "[[", "features")
        len <- sapply(feature, length)
        data.frame(publication=rep(publication, len),
                   feature=unlist(feature, use.names=FALSE), 
                   miniref=rep(miniref, len))
    })
}

branches <- branchFunction()
xmlEventParse("~/Downloads/jnk.xml", handlers=NULL, branches=branches)
# xmlEventParse("~/Downloads/FBrf.xml", handlers=NULL, branches=branches)
branches$get()

I upload the xml file to my server http://download.dejung.net/jnk.xml

The file has only a few kb, but the problem is the result. The second publication entry has an id FBrf0162243 and a miniref of Schwartz et al., 2003, Mol. Cell. Biol. 23(19): 6876--6886.

My results from the code I posted above reports the wrong publication id to the corresponding miniref. The feature ids are correct....

FBrf0050934 FBgn0003277 Schwartz et al., 2003, Mol. Cell. Biol. 23(19): 6876--6886

Not sure why my code is reporting the wrong values, maybe someone can help me with the closures since this is very new to me.

like image 862
drmariod Avatar asked Feb 02 '16 14:02

drmariod


1 Answers

I occasionally encounter "embedded NULL" error messages that may be similar to this (if the 0x0 in this message means the same NULL issue). My approach is to try to delete them before reading in the file, as I have not found an R package that ignores them.

If you are on Unix or OS X, you could invoke sed in your R program via:

system( 'sed "s/\\0//g" ~/Downloads/dirty.xml > ~/Downloads/clean.xml' )

If this doesn't do the trick, you might want to expand this "blacklist" of characters -- see for example Unicode Regex; Invalid XML characters

If something is still wrong then sometimes I make a character whitelist -- delete everything not in the specified character set..

sed 's/[^A-Za-z0-9 _.,"]//g' ~/Downloads/dirty.csv > ~/Downloads/clean.csv

This is the one I use for .csv data files (don't care about </etc.>), so you'd maybe want to expand it to something like [^[:ascii:]]:

If you are on Windows, you likely have to go outside of R for this approach -- for example you can use Cygwin instead of the system() invocation above.

like image 163
C8H10N4O2 Avatar answered Nov 15 '22 09:11

C8H10N4O2