get size of parquet file in HDFS for repartition with Spark in Scala

Q: How do I read a Parquet file from HDFS spark?

Write & Read Parquet file from HDFS parquet) to read the parquet files and creates a Spark DataFrame. Using spark. write. parquet() function we can write Spark DataFrame to Parquet file, and parquet() function is provided in DataFrameWriter class.

Q: What is the optimal size of Parquet file?

The official Parquet documentation recommends a disk block/row group/file size of 512 to 1024 MB on HDFS.

Q: What is Parquet block size?

parquet. block-size parameter is 268435456 (256 MB), the same size as file system chunk sizes. In previous versions of Drill, the default value was 536870912 (512 MB).

Tags:

scala

apache-spark

hadoop

hdfs

parquet

I have many parquet file directories on HDFS that contain a few thousands of small(most < 100kb) parquet files each. They slow down my Spark job, so I want to combine them.

With the following code I can repartition the local parquet file to smaller number of parts:

val pqFile = sqlContext.read.parquet("file:/home/hadoop/data/file.parquet")
pqFile.coalesce(4).write.save("file:/home/hadoop/data/fileSmaller.parquet")

But I don't know how to get the size of a directory on HDFS through Scala code programmatically, hence I can't work out the number of partitions to pass to the coalesce function for the real data set.

How can I do this? Or is there a convenient way within Spark so that I can configure the writer to write fixed size of parquet partition?

856

asked Nov 29 '15 22:11

Bamqf

1 Answers

You could try

pqFile.inputFiles.size

which returns "a best-effort snapshot of the files that compose this DataFrame" according to the documentation.

As an alternative, directly on the HDFS level:

val hdfs: org.apache.hadoop.fs.FileSystem =
  org.apache.hadoop.fs.FileSystem.get(
    new org.apache.hadoop.conf.Configuration())

val hadoopPath= new org.apache.hadoop.fs.Path("hdfs://localhost:9000/tmp")
val recursive = false
val ri = hdfs.listFiles(hadoopPath, recursive)
val it = new Iterator[org.apache.hadoop.fs.LocatedFileStatus]() {
  override def hasNext = ri.hasNext
  override def next() = ri.next()
}

// Materialize iterator
val files = it.toList
println(files.size)
println(files.map(_.getLen).sum)

This way you get the file sizes as well.

answered Sep 22 '22 00:09

Beryllium

Related questions
                            
                                Will scala compiler hoist regular expressions
                            
                                Accessing values from path-dependent type mixin
                            
                                Why TypeTag doesnt have method runtimeClass but Manifest and ClassTag do
                            
                                Scala idiom for partial models?
                            
                                How can I filter with inSetBind for multiple columns in Slick?
                            
                                Avoid `Boolean.box`
                            
                                akka timeout when using spray client for multiple request
                            
                                Scala Spark: Split collection into several RDD?
                            
                                The differences between underscore usage in these scala's methods
                            
                                How to use ConcurrentLinkedQueue in Scala?
                            
                                ScalaCheck - Ordered array generator
                            
                                java.lang.ClassNotFoundException,when I use "spark-submit" with a new class name rather than "SimpleApp",
                            
                                Does the scala compiler do anything to optimize implicit classes?
                            
                                Is Scala strongly typed ? [closed]
                            
                                How to provide a default typeclass for generic types in Scala?
                            
                                Scala syntax strangeness with :: and requiring lower case
                            
                                How to create a graph from Array[(Any, Any)] using Graph.fromEdgeTuples
                            
                                Performance of splitAt function on a vector
                            
                                How can you can write generic Scala enhancement methods that bind collection type as well as element type?
                            
                                Scala macro - Infer implicit value using `c.prefix`

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With