What is the most efficient way to read only a subset of columns in spark from a parquet file that has many columns? Is using <code>spark.read.format("parquet").load(<parquet>).select(...col1, col2)</code> the best way to do that? I would also prefer to use typesafe dataset with case classes to pre-define my schema but not sure.

<pre class="prettyprint lang-scala prettyprint-override"><code>val df = spark.read.parquet("fs://path/file.parquet").select(...) </code></pre> This will only read the corresponding columns. Indeed, parquet is a columnar storage and it is exactly meant for this type of use case. Try running <code>df.explain</code> and spark will tell you that only the corresponding columns are read (it prints the execution plan). <code>explain</code> would also tell you what filters are pushed down to the physical plan of execution in case you also use a where condition. Finally use the following code to convert the dataframe (dataset of rows) to a dataset of your case class. <pre class="prettyprint lang-scala prettyprint-override"><code>case class MyData... val ds = df.as[MyData] </code></pre>

At least in some cases getting dataframe with all columns + selecting a subset won't work. E.g. the following will fail if parquet contains at least one field with type that is not supported by Spark: <pre class="prettyprint lang-py prettyprint-override"><code>spark.read.format("parquet").load("<path_to_file>").select("col1", "col2") </code></pre> One solution is to provide schema that contains only requested columns to <code>load</code>: <pre class="prettyprint lang-py prettyprint-override"><code>spark.read.format("parquet").load("<path_to_file>", schema="col1 bigint, col2 float") </code></pre> Using this you will be able to load a subset of Spark-supported parquet columns even if loading the full file is not possible. I'm using pyspark here, but would expect Scala version to have something similar.

Parquet is a columnar file format. It is exactly designed for these kind of use cases. <pre class="prettyprint"><code>val df = spark.read.parquet("<PATH_TO_FILE>").select(...) </code></pre> should do the job for you.

Efficient way to read specific columns from parquet file in spark

Tags:

apache-spark

parquet

What is the most efficient way to read only a subset of columns in spark from a parquet file that has many columns? Is using spark.read.format("parquet").load(<parquet>).select(...col1, col2) the best way to do that? I would also prefer to use typesafe dataset with case classes to pre-define my schema but not sure.

521

asked Jan 24 '18 12:01

horatio1701d

4 Answers

val df = spark.read.parquet("fs://path/file.parquet").select(...)

This will only read the corresponding columns. Indeed, parquet is a columnar storage and it is exactly meant for this type of use case. Try running df.explain and spark will tell you that only the corresponding columns are read (it prints the execution plan). explain would also tell you what filters are pushed down to the physical plan of execution in case you also use a where condition. Finally use the following code to convert the dataframe (dataset of rows) to a dataset of your case class.

case class MyData...
val ds = df.as[MyData]

117

answered Oct 04 '22 11:10

Oli

Spark supports pushdowns with Parquet so

load(<parquet>).select(...col1, col2)

is fine.

I would also prefer to use typesafe dataset with case classes to pre-define my schema but not sure.

This could be an issue, as it looks like some optimizations don't work in this context Spark 2.0 Dataset vs DataFrame

answered Oct 04 '22 09:10

Alper t. Turker

At least in some cases getting dataframe with all columns + selecting a subset won't work. E.g. the following will fail if parquet contains at least one field with type that is not supported by Spark:

spark.read.format("parquet").load("<path_to_file>").select("col1", "col2")

One solution is to provide schema that contains only requested columns to load:

spark.read.format("parquet").load("<path_to_file>",
                                   schema="col1 bigint, col2 float")

Using this you will be able to load a subset of Spark-supported parquet columns even if loading the full file is not possible. I'm using pyspark here, but would expect Scala version to have something similar.

answered Oct 04 '22 10:10

Alexander Pivovarov

Parquet is a columnar file format. It is exactly designed for these kind of use cases.

val df = spark.read.parquet("<PATH_TO_FILE>").select(...)

should do the job for you.

answered Oct 04 '22 09:10

moriarty007

Related questions
                            
                                Spark SQL filter multiple fields
                            
                                Use Spark to list all files in a Hadoop HDFS directory?
                            
                                Apache Drill vs Spark [closed]
                            
                                Building a StructType from a dataframe in pyspark
                            
                                How to select last row and also how to access PySpark dataframe by index?
                            
                                How to connect to remote hive server from spark [duplicate]
                            
                                Is dataframe.show() an action in spark?
                            
                                dynamically bind variable/parameter in Spark SQL?
                            
                                Spark UI on AWS EMR
                            
                                How to fix java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List to field type scala.collection.Seq?
                            
                                Why does Scala compiler fail with "no ': _*' annotation allowed here" when Row does accept varargs?
                            
                                Scala Error: Could not find or load main class in both Scala IDE and Eclipse
                            
                                How to configure Apache Spark random worker ports for tight firewalls?
                            
                                Where is the Spark UI on Google Dataproc?
                            
                                How to convert ArrayType to DenseVector in PySpark DataFrame?
                            
                                Executing separate streaming queries in spark structured streaming
                            
                                Unable to run a basic GraphFrames example
                            
                                unexpected type: <class 'pyspark.sql.types.DataTypeSingleton'> when casting to Int on a ApacheSpark Dataframe
                            
                                Link Spark with iPython Notebook
                            
                                How to fix "java.io.NotSerializableException: org.apache.kafka.clients.consumer.ConsumerRecord" in Spark Streaming Kafka Consumer?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With