How to use linux csplit to chop up massive XML file?

Question

I have a gigantic (4GB) XML file that I am currently breaking into chunks with linux "split" function (every 25,000 lines - not by bytes). This usually works great (I end up with about 50 files), except some of the data descriptions have line breaks, and so frequently the chunk files do not have the proper closing tags - and my parser chokes halfway through processing.

Example file: (note: normally each "listing" xml node is supposed to be on its own line)

<?xml version="1.0" encoding="UTF-8"?>
<listings>
<listing><date>2009-09-22</date><desc>This is a description WITHOUT line breaks and works fine with split</desc><more_tags>stuff</more_tags></listing>
<listing><date>2009-09-22</date><desc>This is a really
annoying description field
WITH line breaks 
that screw the split function</desc><more_tags>stuff</more_tags></listing>
</listings>

Then sometimes my split ends up like

<?xml version="1.0" encoding="UTF-8"?>
<listings>
<listing><date>2009-09-22</date><desc>This is a description WITHOUT line breaks and works fine with split</desc><more_tags>stuff</more_tags></listing>
<listing><date>2009-09-22</date><desc>This is a really
annoying description field
WITH line breaks ... 
EOF

So - I have been reading about "csplit" and it sounds like it might work to solve this issue. I cant seem to get the regular expression right...

Basically I want the same output of ~50ish files

Something like:

*csplit -k myfile.xml '/</listing>/' 25000 {50}

Any help would be great Thanks!

bmargulies · Accepted Answer

You can't get a valid XML file this way. I would recommend that you write a java program using StaX, which, if you use the WoodStox implementation, will go really quite fast streaming the XML in and out.

StaxMan · Answer

I would recommend against trying to use regexps (or naive text matching) for any xml manipulation, including splitting. XML is tricky enough to deal with that parser should be used; and due to memory limitations, one that can do "streaming" (aka incremental / chunked) parsing. I am most familiar with Java, where you would use Stax (or SAX) parser and writer/generator to do this; most other languages have something similar. Or if input is regular enough, data binding tool (JAXB) that can bind subtrees.

Doing it right way may be bit more work, but would actually work, dealing with things xml can have (for example, CDATA sections can not be split; regexp solutions invariably have cases they wouldn't handle, until one has basically written a full xml parser).

ZyX · Answer

Use perl:

perl -p -i -e 'unless(defined$fname){$fname="xx00";open$fh,">",$fname;}$size+=length;print$fh $_;if($size>%MAX% and m@</listing>@){$fname++;$size=0;open$fh,">",$fname;}'

Replace %MAX% with maximum size of one file in bytes.

How to use linux csplit to chop up massive XML file?

Tags:

regex

linux

split

xml

Fred

3 Answers

bmargulies

StaxMan

ZyX

Recent Activity

Donate For Us

How to use linux csplit to chop up massive XML file?

Tags:

regex

linux

split

xml

Fred

3 Answers

bmargulies

StaxMan

ZyX

Related questions

Recent Activity

Donate For Us