I'm pretty new to Spark (2 days) and I'm pondering the best way to partition parquet files. My rough plan ATM is: <ul> <li>read in the source TSV files with com.databricks.spark.csv (these have a TimeStampType column)</li> <li>write out parquet files, partitioned by year/month/day/hour</li> <li>use these parquet files for all the queries that'll then be occurring in future </li> </ul> It's been ludicrously easy (kudos to Spark devs) to get a simple version working - except for partitioning the way I'd like to. This is in python BTW: <pre class="prettyprint lang-python prettyprint-override"><code>input = sqlContext.read.format('com.databricks.spark.csv').load(source, schema=myschema) input.write.partitionBy('type').format("parquet").save(dest, mode="append") </code></pre> Is the best approach to map the RDD, adding new columns for year, month, day, hour and then use <code>PartitionBy</code>? Then for any queries we have to manually add year/month etc? Given how elegant I've found spark to be so far, this seems a little odd. Thanks

I've found a few ways to do this now, not yet run performance tests over them, caveat emptor: First we need to create a derived DataFrame (three ways shown below) and then write it out. 1) sql queries (inline functions) <pre class="prettyprint"><code>sqlContext.registerFunction("day",lambda f: f.day, IntegerType()) input.registerTempTable("input") input_ts = sqlContext.sql( "select day(inserted_at) AS inserted_at_day, * from input") </code></pre> 2) sql queries (non-inline) - very similar <pre class="prettyprint"><code>def day(ts): return f.day sqlContext.registerFunction("day",day, IntegerType()) ... rest as before </code></pre> 3) withColumn <pre class="prettyprint"><code>from pyspark.sql.functions import udf day = udf(lambda f: f.day, IntegerType()) input_ts = input.withColumn('inserted_at_day',day(input.inserted_at)) </code></pre> To write out just: <pre class="prettyprint"><code>input_ts.write.partitionBy(['inserted_at_day']).format("parquet").save(dest, mode="append") </code></pre>

What are the best practices to partition Parquet files by timestamp in Spark?

Tags:

apache-spark

pyspark

I'm pretty new to Spark (2 days) and I'm pondering the best way to partition parquet files.

My rough plan ATM is:

read in the source TSV files with com.databricks.spark.csv (these have a TimeStampType column)
write out parquet files, partitioned by year/month/day/hour
use these parquet files for all the queries that'll then be occurring in future

It's been ludicrously easy (kudos to Spark devs) to get a simple version working - except for partitioning the way I'd like to. This is in python BTW:

input = sqlContext.read.format('com.databricks.spark.csv').load(source, schema=myschema)
input.write.partitionBy('type').format("parquet").save(dest, mode="append")

Is the best approach to map the RDD, adding new columns for year, month, day, hour and then use PartitionBy? Then for any queries we have to manually add year/month etc? Given how elegant I've found spark to be so far, this seems a little odd.

Thanks

454

asked Jul 17 '15 10:07

Adrian Bridgett

1 Answers

I've found a few ways to do this now, not yet run performance tests over them, caveat emptor:

First we need to create a derived DataFrame (three ways shown below) and then write it out.

1) sql queries (inline functions)

sqlContext.registerFunction("day",lambda f: f.day, IntegerType())
input.registerTempTable("input")
input_ts = sqlContext.sql(
  "select day(inserted_at) AS inserted_at_day, * from input")

2) sql queries (non-inline) - very similar

def day(ts):
  return f.day
sqlContext.registerFunction("day",day, IntegerType())
... rest as before

3) withColumn

from pyspark.sql.functions import udf
day = udf(lambda f: f.day, IntegerType())
input_ts = input.withColumn('inserted_at_day',day(input.inserted_at))

To write out just:

input_ts.write.partitionBy(['inserted_at_day']).format("parquet").save(dest, mode="append")

answered Oct 04 '22 18:10

Adrian Bridgett

Related questions
                            
                                Databricks Exception: Total size of serialized results is bigger than spark.driver.maxResultsSize
                            
                                Spark - Non-time-based windows are not supported on streaming DataFrames/Datasets;
                            
                                Spark Kryo register for array class
                            
                                How does Round Robin partitioning in Spark work?
                            
                                Why does Spark groupBy.agg(min/max) of BigDecimal always return 0?
                            
                                Submitting pyspark script to a remote Spark server?
                            
                                What's the purpose of OutputMode in flatMapGroupsWithState? How/where is it used?
                            
                                List all additional jars loaded in pyspark
                            
                                pyspark 'DataFrame' object has no attribute '_get_object_id'
                            
                                Using partitions (with partitionBy) when writing a delta lake has no effect
                            
                                Why joining structure-identic dataframes gives different results?
                            
                                Spark processing columns in parallel
                            
                                How to run script in Pyspark and drop into IPython shell when done?
                            
                                how to run python script in spark job?
                            
                                spark scalability: what am I doing wrong?
                            
                                how to collect spark sql output to a file?
                            
                                How to save/export a Spark ML Lib model to PMML?
                            
                                Concurrent job Execution in Spark
                            
                                Equivalent of Distributed Cache in Spark? [duplicate]
                            
                                Spark MLlib: building classifiers for each data group

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What are the best practices to partition Parquet files by timestamp in Spark?

Tags:

apache-spark

pyspark

Adrian Bridgett

People also ask

1 Answers

Adrian Bridgett

Recent Activity

Donate For Us