Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Searching for regex patterns on a 30GB XML dataset. Making use of 16gb of memory

Tags:

java

xml

I currently have a Java SAX parser that is extracting some info from a 30GB XML file.

Presently it is:

  • reading each XML node
  • storing it into a string object,
  • running some regexex on the string
  • storing the results to the database

For several million elements. I'm running this on a computer with 16GB of memory, but the memory is not being fully utilized.

Is there a simple way to dynamically 'buffer' about 10gb worth of data from the input file?

I suspect I could manually take a 'producer' 'consumer' multithreaded version of this (loading the objects on one side, using them and discarding on the other), but damnit, XML is ancient now, are there no efficient libraries to crunch em?

like image 774
Achille Avatar asked Oct 29 '25 17:10

Achille


2 Answers

  1. Just to cover the bases, is Java able to use your 16GB? You (obviously) need to be on a 64-bit OS, and you need to run Java with -d64 -XMx10g (or however much memory you want to allocate to it).

  2. It is highly unlikely memory is a limiting factor for what you're doing, so you really shouldn't see it fully utilized. You should be either IO or CPU bound. Most likely, it'll be IO. If it is, IO, make sure you're buffering your streams, and then you're pretty much done; the only thing you can do is buy a faster harddrive.

  3. If you really are CPU-bound, it's possible that you're bottlenecking at regex rather than XML parsing.

    See this (which references this)

  4. If your bottleneck is at SAX, you can try other implementations. Off the top of my head, I can think of the following alternatives:

    • StAX (there are multiple implementations; Woodstox is one of the fastest)
    • Javolution
    • Roll your own using JFlex
    • Roll your own ad hoc, e.g. using regex

    For the last two, the more constrained is your XML subset, the more efficient you can make it.

  5. It's very hard to say, but as others mentioned, an XML-native database might be a good alternative for you. I have limited experience with those, but I know that at least Berkeley DB XML supports XPath-based indices.

like image 136
ykaganovich Avatar answered Nov 01 '25 05:11

ykaganovich


First, try to find out what's slowing you down.

  • How much faster is the parser when you parse from memory?
  • Does using a BufferedInputStream with a large size help?

Is it easy to split up the XML file? In general, shuffling through 30 GiB of any kind of data will take some time, since you have to load it from the hard drive first, so you are always limited by the speed of this. Can you distribute the load to several machines, maybe by using something like Hadoop?

like image 25
Torsten Marek Avatar answered Nov 01 '25 05:11

Torsten Marek