Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to efficiently replace characters in XML document in Java?

I'm looking for a neat and efficient way to replace characters in XML document. There is a replacement table defined for almost 12.000 UTF-8 characters, most of them are to be replaced by single characters, but some must be replaced by two or even three characters (e.g. Greek theta should become TH). The documents can be bulky (100MB+). How to do it in Java? I came up with the idea of using XSLT, but I'm not too sure if this is the best option.

like image 449
Tomasz Błachowicz Avatar asked Oct 31 '25 00:10

Tomasz Błachowicz


2 Answers

String.replace(..) is very slow, based on my experience. I used to parse 100MB KML files using that API and the performance is just bad. Then, I pre-compiled the regular expression using Pattern.compile(..) and that worked whole lot faster.

like image 130
limc Avatar answered Nov 01 '25 13:11

limc


Have a look at SAX which allows you to see each individual part of the XML document as they pass by. You can then take action on text nodes and do the manipulation you need.

The problem with XSLT is that most implementations need the whole input tree in memory, which is typically 10 times the size on disk. I only know of the commercial edition of Saxon XSLT transformer which can do streaming XSLT (but that would be perfect for your needs).

like image 43
Thorbjørn Ravn Andersen Avatar answered Nov 01 '25 15:11

Thorbjørn Ravn Andersen



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!