Currently, when I STORE into HDFS, it creates many part files.
Is there any way to store out to a single CSV file?
First, open the CSV file for writing ( w mode) by using the open() function. Second, create a CSV writer object by calling the writer() function of the csv module. Third, write data to CSV file by calling the writerow() or writerows() method of the CSV writer object.
A CSV file contains a number of rows, each containing a number of columns, usually separated by commas.
It's probably the worst storage format if you don't plan to view or edit data on the fly. If you're storing large volumes of data, opting for CSVs will cost you both time and money. Today you'll learn about five CSV alternatives. Each provides an advantage, either in read/write time or in file size.
You can do this in a few ways:
To set the number of reducers for all Pig opeations, you can use the default_parallel
property - but this means every single step will use a single reducer, decreasing throughput:
set default_parallel 1;
Prior to calling STORE, if one of the operations execute is (COGROUP, CROSS, DISTINCT, GROUP, JOIN (inner), JOIN (outer), and ORDER BY), then you can use the PARALLEL 1
keyword to denote the use of a single reducer to complete that command:
GROUP a BY grp PARALLEL 1;
See Pig Cookbook - Parallel Features for more information
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With