Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark: difference when read in .gz and .bz2

I normally read and write files in Spark using .gz, which the number of files should be the same as the number of RDD partitions. I.e. one giant .gz file will read in to a single partition. However, if I read in one single .bz2, would I still get one single giant partition? Or will Spark support automatic split one .bz2 to multiple partitions?

Also, how do I know how many partitions it would be while Hadoop read in it from one bz2 file. Thanks!

like image 338
Edamame Avatar asked May 25 '16 18:05

Edamame


2 Answers

    However, if I read in one single .bz2, would I still get one single giant partition?   
Or will Spark support automatic split one .bz2 to multiple partitions?

If you specify n partitions to read a bzip2 file, Spark will spawn n tasks to read the file in parallel. The default value of n is set to sc.defaultParallelism. The number of partitions is the second argument in the call to textFile (docs).


. one giant .gz file will read in to a single partition.

Please note that you can always do a

sc.textFile(myGiantGzipFile).repartition(desiredNumberOfPartitions)

to get the desired number of partitions after the file has been read.


Also, how do I know how many partitions it would be while Hadoop read in it from one bz2 file.

That would be yourRDD.partitions.size for the scala api or yourRDD.getNumPartitions() for the python api.

like image 109
axiom Avatar answered Oct 28 '22 18:10

axiom


I don't know why my test-program run on one executor, after some test I think I get it, like that:

by pySpark

// Load a DataFrame of users. Each line in the file is a JSON 

// document, representing one row.

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

val user = sqlContext.read.json("users.json.bz2")
like image 23
史荣琦 Avatar answered Oct 28 '22 16:10

史荣琦