Is there a way to read HDF5 files using the Scala version of Spark?
It looks like it can be done in Python (via Pyspark), but I can't find anything for Scala.
There isn't a Hadoop InputFormat
implementation for HDF5 because it is not capable of being arbitrarily split:
Breaking the container into blocks is a bit like taking an axe and chopping it to pieces, severing blindly the content and the smart wiring in the process. The result is a mess, because there's no alignment or correlation between HDFS block boundaries and the internal HDF5 cargo layout or container support structure. Reference
The same site discusses the possibility of transforming HDF5 files to Avro files, thus enabling them to be read by Hadoop/Spark, but the PySpark example you alluded to is probably a simpler way to go, but as the linked document mentions, there are a number of technical challenges that need to be addressed to efficiently and effectively work with HDF5 documents in Hadoop/Spark.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With