Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading HDF5 files [closed]

Is there a way to read HDF5 files using the Scala version of Spark?

It looks like it can be done in Python (via Pyspark), but I can't find anything for Scala.

like image 210
John Avatar asked Feb 17 '15 16:02

John


1 Answers

There isn't a Hadoop InputFormatimplementation for HDF5 because it is not capable of being arbitrarily split:

Breaking the container into blocks is a bit like taking an axe and chopping it to pieces, severing blindly the content and the smart wiring in the process. The result is a mess, because there's no alignment or correlation between HDFS block boundaries and the internal HDF5 cargo layout or container support structure. Reference

The same site discusses the possibility of transforming HDF5 files to Avro files, thus enabling them to be read by Hadoop/Spark, but the PySpark example you alluded to is probably a simpler way to go, but as the linked document mentions, there are a number of technical challenges that need to be addressed to efficiently and effectively work with HDF5 documents in Hadoop/Spark.

like image 196
Timothy Perrigo Avatar answered Oct 01 '22 19:10

Timothy Perrigo