I have a requirement where I want to split 5GB ORC file into 5 files with 1 GB size each. ORC file is splittable. Does that mean we can only split file stripe by stripe ? but I have requirement where I want to split orc file based on size. for ex.split 5GB ORC file into 5 files with 1 GB size each. if possible please share example.
A common approach and considering that you file size can be 5GB, 100GB, 1TB, 100TB, etc. You might want to mount a Hive table pointing to this file and define one more table pointing to a different directory, then run an insert from one table to the other using insert statement provided by Hive.
At the beginning of the script, make sure you have the following Hive flags:
set hive.merge.mapredfiles=true;
set hive.merge.mapfiles=true;
set hive.merge.smallfiles.avgsize=1073741824;
set hive.merge.size.per.task=1073741824;
In this way, the output average for each reducer will be 1073741824 Bytes which is equal to 1GB.
If you want to use only Java code, play with these flags:
mapred.max.split.size
mapred.min.split.size
Please check these, they are very useful:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With