Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java regex alternative for bytes on stream

I have XML files (encoded in UTF-8) that have two issues:

  • Some of them (not all) contain a Byte order mark EF BB BF

  • Some of them (not all) contain Null characters 00, distributed over the whole file.

Both issues prevent me from parsing the XML with a SAX Parser. My current approach was to read the file into a String and use regex in order to extract these characters and write the string back to a file, which worked fine. However my files are quite large (hundreds of Megabytes) and reading the file into a String an creating a result String of the same size every time I call a replaceAll(), quickly leads to a java heap space error.

Increasing the heap size is definitely not a long term solution. I will need to stream the file and extract all these character on the fly.

Any suggestions on how an efficient solution should look like?

like image 647
Will Avatar asked Jun 30 '26 13:06

Will


2 Answers

I would subclass FilterInputStream to filter out the undesired bytes at runtime.

The task should be rather easy as byte order marks are probably only at the start of the file (so you only need to check there) and nul-bytes can easily be flter with a simple == comparison (no need for regex-like features).

This will most likely also increase performance as you don't need to write out the full corrected file to disk before re-reading it.

like image 97
Joachim Sauer Avatar answered Jul 03 '26 01:07

Joachim Sauer


Why don't you filter the data as you read it into the SAX parser. This way you won't need to re-write the file. You can override the read() methods of FilterInputStream to drop the bytes you don't want.

I think that is what @Joachim is suggesting. ;)

like image 38
Peter Lawrey Avatar answered Jul 03 '26 03:07

Peter Lawrey