Spark: difference when read in .gz and .bz2

Question

I normally read and write files in Spark using .gz, which the number of files should be the same as the number of RDD partitions. I.e. one giant .gz file will read in to a single partition. However, if I read in one single .bz2, would I still get one single giant partition? Or will Spark support automatic split one .bz2 to multiple partitions?

Also, how do I know how many partitions it would be while Hadoop read in it from one bz2 file. Thanks!

axiom · Accepted Answer

    However, if I read in one single .bz2, would I still get one single giant partition?   
Or will Spark support automatic split one .bz2 to multiple partitions?

If you specify n partitions to read a bzip2 file, Spark will spawn n tasks to read the file in parallel. The default value of n is set to sc.defaultParallelism. The number of partitions is the second argument in the call to textFile (docs).

. one giant .gz file will read in to a single partition.

Please note that you can always do a

sc.textFile(myGiantGzipFile).repartition(desiredNumberOfPartitions)

to get the desired number of partitions after the file has been read.

Also, how do I know how many partitions it would be while Hadoop read in it from one bz2 file.

That would be yourRDD.partitions.size for the scala api or yourRDD.getNumPartitions() for the python api.

史荣琦 · Answer

I don't know why my test-program run on one executor, after some test I think I get it, like that:

by pySpark

// Load a DataFrame of users. Each line in the file is a JSON 

// document, representing one row.

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

val user = sqlContext.read.json("users.json.bz2")

Spark: difference when read in .gz and .bz2

Tags:

gzip

apache-spark

rdd

bz2

Edamame

2 Answers

axiom

史荣琦

Recent Activity

Donate For Us

Spark: difference when read in .gz and .bz2

Tags:

gzip

apache-spark

rdd

bz2

Edamame

2 Answers

axiom

史荣琦

Related questions

Recent Activity

Donate For Us