Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read XML files from apache spark framework?

I did come across a mini tutorial for data preprocessing using spark here: http://ampcamp.berkeley.edu/big-data-mini-course/featurization.html

However, this discusses only about textfile parsing. Is there a way to parse xml files from spark system?

like image 210
Anitha Avatar asked Nov 26 '13 18:11

Anitha


People also ask

Is XML supported by Spark?

Though there is nothing wrong with this approach, Spark also supports a library provided by Databricks that can process a format-free XML file in a distributed way.

How do I access XML data?

Just about every browser can open an XML file. In Chrome, just open a new tab and drag the XML file over. Alternatively, right click on the XML file and hover over "Open with" then click "Chrome". When you do, the file will open in a new tab.


2 Answers

It looks like somebody made an xml datasource for apache-spark.

https://github.com/databricks/spark-xml

This supports to read XML files by specifying tags and infer types e.g.

import org.apache.spark.sql.SQLContext

val sqlContext = new SQLContext(sc)
val df = sqlContext.read
    .format("com.databricks.spark.xml")
    .option("rowTag", "book")
    .load("books.xml")

You can also use it with spark-shell as below:

$ bin/spark-shell --packages com.databricks:spark-xml_2.11:0.3.0
like image 164
BoMi Kim Avatar answered Sep 21 '22 15:09

BoMi Kim


I have not used it myself, but the way would be same as you do it for hadoop. For example you can use StreamXmlRecordReader and process the xmls. The reason you need a record reader is you would like to control the record boundries for each element processed otherwise the default used would process line because it uses LineRecordReader. It would be helpful to get yourself more familiar with concept of recordReader in hadoop.

And ofcourse you will have to use SparkContext's hadoopRDD or hadoopFile methods with option to pass a InputFormatClass. Incase java is your preferred language, similar alternatives exist.

like image 36
Prashant Sharma Avatar answered Sep 19 '22 15:09

Prashant Sharma