Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Streaming frameworks on top of Hadoop that support ORC, parquet file formats [closed]

Does Hadoop streaming support the new columnar storage formats like ORC and parquet or are there frameworks on top of Hadoop that allows you to read such formats?

like image 601
viper Avatar asked Nov 01 '22 02:11

viper


2 Answers

You can use HCatalog to read ORC File. https://cwiki.apache.org/confluence/display/Hive/HCatalog+UsingHCat

It provides you an abstraction to read ORC, Text, Sequence, RC files. I am not sure if there is support of parquet there. Nonetheless if this doesn't sound reasonable, you can use ORC record readers in the Hive code base to read ORC Files (ORCInputFormat, ORCOutputFormat).

like image 113
user3614890 Avatar answered Nov 15 '22 08:11

user3614890


Rather old news, but I struggled with this some time ago. I did not found any solution for this so, as a result, I've made a set of input/output formats that convert avro and parquet files to/from plain text and json. It can be found at http://github.com/whale2/iow-hadoop-streaming. There's no ORC support, but Avro and Parquet are supported. Hope this helps.

like image 32
user3134802 Avatar answered Nov 15 '22 07:11

user3134802