I am creating a single RDD of 2.csv files like this
val combineRDD = sc.textFile("D://release//CSVFilesParellel//*.csv")
Then I want to define custom partition on this RDD such that one partition must contain one file. so that each partition i.e.one csv file is processed across one node for faster data processing
Is it possible to write a custom partitioner based on the file size or the number of lines in one file or the end of file character of one file ?
How do I achieve this ?
The structure of one file looks something like this:
00-00
Time(in secs) Measure1 Measure2 Measure3..... Measuren
0
0.25
0.50
0.75
1
...
3600
1.The first row of data contains the hours: mins Each file contains data for 1 hour or 3600secs
2.The first column is a second divided into 4 parts of 250 ms each and data recorded for 250 ms
For every file I want to add the number of hours: mins to the seconds so that my time looks something like this hours-mins-secs. But the catch is I dont want this process to happen sequentially
I am using the for-each function for getting each file name -> then creating an RDD of the data in the file and adding the time as specified above.
But what I want is that every file should go to one node for processing and calculating time as opposed to data in one file getting distributed across nodes for calculating the time.
Thank you.
Regards,
Vinay Joglekar
Lets go back to basics.
Gzip being non splittable will force one file going to one partition but this in no way will help in anyKind throughput increase
To Write Custom Input Format Extend from FileInputFormat and give your splitlogic and recordReader logic.
To use custom input format in spark please follow
http://bytepadding.com/big-data/spark/combineparquetfileinputformat/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With