Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Splitting large XML files into manageble sections for Hadoop

Tags:

xml

hadoop

Is there a input class to deal with [multiple] large XML files based on their tree structure in Hadoop? I have a set of XML files that are of the same schema, but I need to split them into sections of data, as opposed to breaking the sections up.

For example the XML file would be:

<root>
  <parent> data </parent>
  <parent> more data</parent>
  <parent> even more data</parent>
</root>

I would define each section as: /root/parent.

What I'm asking is: Is there a record input reader already included for Hadoop to do this?

like image 900
monksy Avatar asked Nov 14 '22 12:11

monksy


1 Answers

I think the Cloud9 project at UMD might help you with this.

The library provides has an XMLInputFormat class which might be of use.

Also of interest is this page in the Cloud9 documentation which looks at how you can deal with an XML dump of Wikipedia in MapReduce.

like image 82
Binary Nerd Avatar answered Feb 26 '23 01:02

Binary Nerd