Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I lazily parse big XHTML file in Clojure?

I have valid XHTML file (100 megabytes of data) with one large table. First tr are columns (for database), all other tr's are data. It is the only table in whole document and it is in structure html->body->div->table.

How can I parse it lazy way in Clojure?

I know about data.xml but because I am Clj beginner it is very difficult for me to let it work. Especially because REPL is very slow while working with such a big file.

like image 646
Jiri Knesl Avatar asked Jan 15 '13 08:01

Jiri Knesl


1 Answers

data.xml docs says it creates lazy tree of a document: parse. I checked locally and it seems to be true:

; Load libs
(require '[clojure.data.xml :as xml])
(require '[clojure.java.io :as io])

; standard.xml is 100MB xml file from here http://www.xml-benchmark.org/downloads.html
(def xml-tree (xml/parse (io/reader "standard.xml")))
(:tag xml-tree) => :site

(def child (first (:content xml-tree)))
(:tag child) => :regions

(dorun (:content xml-tree)) => REPL hangs for ~30 seconds on my computer because it tries to parse whole file
like image 140
Mikita Belahlazau Avatar answered Nov 18 '22 11:11

Mikita Belahlazau