Does Hadoop streaming support the new columnar storage formats like ORC and parquet or are there frameworks on top of Hadoop that allows you to read such formats?
You can use HCatalog to read ORC File. https://cwiki.apache.org/confluence/display/Hive/HCatalog+UsingHCat
It provides you an abstraction to read ORC, Text, Sequence, RC files. I am not sure if there is support of parquet there. Nonetheless if this doesn't sound reasonable, you can use ORC record readers in the Hive code base to read ORC Files (ORCInputFormat, ORCOutputFormat).
Rather old news, but I struggled with this some time ago. I did not found any solution for this so, as a result, I've made a set of input/output formats that convert avro and parquet files to/from plain text and json. It can be found at http://github.com/whale2/iow-hadoop-streaming. There's no ORC support, but Avro and Parquet are supported. Hope this helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With