Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Partition RDD in Apache Spark such that one partition consists on one file

I am creating a single RDD of 2.csv files like this

val combineRDD = sc.textFile("D://release//CSVFilesParellel//*.csv")

Then I want to define custom partition on this RDD such that one partition must contain one file. so that each partition i.e.one csv file is processed across one node for faster data processing

Is it possible to write a custom partitioner based on the file size or the number of lines in one file or the end of file character of one file ?

How do I achieve this ?

The structure of one file looks something like this:

00-00

Time(in secs) Measure1 Measure2 Measure3..... Measuren

0

0.25

0.50

0.75

1

...

3600


1.The first row of data contains the hours: mins Each file contains data for 1 hour or 3600secs

2.The first column is a second divided into 4 parts of 250 ms each and data recorded for 250 ms

  1. For every file I want to add the number of hours: mins to the seconds so that my time looks something like this hours-mins-secs. But the catch is I dont want this process to happen sequentially

  2. I am using the for-each function for getting each file name -> then creating an RDD of the data in the file and adding the time as specified above.

  3. But what I want is that every file should go to one node for processing and calculating time as opposed to data in one file getting distributed across nodes for calculating the time.

Thank you.

Regards,

Vinay Joglekar

like image 667
Vinay Joglekar Avatar asked Nov 09 '22 13:11

Vinay Joglekar


1 Answers

Lets go back to basics.

  1. Philosphy of BigData move Process to the data and not data to process. this way one increases parallelism and hence I/O throughput
  2. One partitioner taking one file will decrease the parallelism not increase.
  3. Simplest way to achieve this is use textInpuTFormat and have your input files compresses by gzip or lzo( no lzo indexing should be done).
  4. Gzip being non splittable will force one file going to one partition but this in no way will help in anyKind throughput increase

  5. To Write Custom Input Format Extend from FileInputFormat and give your splitlogic and recordReader logic.

To use custom input format in spark please follow

http://bytepadding.com/big-data/spark/combineparquetfileinputformat/

like image 65
KrazyGautam Avatar answered Nov 14 '22 22:11

KrazyGautam