Spark: save DataFrame partitioned by "virtual" column

Tags:

I'm using PySpark to do classic ETL job (load dataset, process it, save it) and want to save my Dataframe as files/directory partitioned by a "virtual" column; what I mean by "virtual" is that I have a column Timestamp which is a string containing an ISO 8601 encoded date, and I'd want to partition by Year / Month / Day; but I don't actually have either a Year, Month or Day column in the DataFrame; I have this Timestamp from which I can derive these columns though, but I don't want my resultat items to have one of these columns serialized.

File structure resulting from saving the DataFrame to disk should look like:

/ 
    year=2016/
        month=01/
            day=01/
                part-****.gz

Is there a way to do what I want with Spark / Pyspark ?

649

asked Feb 16 '16 16:02

arnaud briche

1 Answers

Columns which are used for partitioning are not included in the serialized data itself. For example if you create DataFrame like this:

df = sc.parallelize([
    (1, "foo", 2.0, "2016-02-16"),
    (2, "bar", 3.0, "2016-02-16")
]).toDF(["id", "x", "y", "date"])

and write it as follows:

import tempfile
from pyspark.sql.functions import col, dayofmonth, month, year
outdir = tempfile.mktemp()

dt = col("date").cast("date")
fname = [(year, "year"), (month, "month"), (dayofmonth, "day")]
exprs = [col("*")] + [f(dt).alias(name) for f, name in fname]

(df
    .select(*exprs)
    .write
    .partitionBy(*(name for _, name in fname))
    .format("json")
    .save(outdir))

individual files won't contain partition columns:

import os

(sqlContext.read
    .json(os.path.join(outdir, "year=2016/month=2/day=16/"))
    .printSchema())

## root
##  |-- date: string (nullable = true)
##  |-- id: long (nullable = true)
##  |-- x: string (nullable = true)
##  |-- y: double (nullable = true)

Partitioning data is stored only in a directory structure and not duplicated in serialized files. It will be attached only when your read complete or partial directory tree:

sqlContext.read.json(outdir).printSchema()

## root
##  |-- date: string (nullable = true)
##  |-- id: long (nullable = true)
##  |-- x: string (nullable = true)
##  |-- y: double (nullable = true)
##  |-- year: integer (nullable = true)
##  |-- month: integer (nullable = true)
##  |-- day: integer (nullable = true)

sqlContext.read.json(os.path.join(outdir, "year=2016/month=2/")).printSchema()

## root
##  |-- date: string (nullable = true)
##  |-- id: long (nullable = true)
##  |-- x: string (nullable = true)
##  |-- y: double (nullable = true)
##  |-- day: integer (nullable = true)

196

answered Oct 17 '22 18:10

zero323

Related questions
                            
                                PySpark vs sklearn TFIDF
                            
                                How far will Spark RDD cache go?
                            
                                Zip support in Apache Spark
                            
                                AttributeError: Can't get attribute 'new_block' on <module 'pandas.core.internals.blocks'>
                            
                                Spark runs out of memory when grouping by key
                            
                                How to upgrade Spark to newer version?
                            
                                Spark case class - decimal type encoder error "Cannot up cast from decimal"
                            
                                Read all Parquet files saved in a folder via Spark
                            
                                How to use first and last function in pyspark?
                            
                                How to save a huge pandas dataframe to hdfs?
                            
                                how to pass python package to spark job and invoke main file from package with arguments
                            
                                scala vs java for Spark? [closed]
                            
                                Spark jobs finishes but application takes time to close
                            
                                Is foreachRDD executed on the Driver?
                            
                                Add one more StructField to schema
                            
                                Loading compressed gzipped csv file in Spark 2.0
                            
                                What is StringIndexer , VectorIndexer, and how to use them?
                            
                                Mapping Spark DataSet row values into new hash column
                            
                                External Hive Table Refresh table vs MSCK Repair
                            
                                get first N elements from dataframe ArrayType column in pyspark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark: save DataFrame partitioned by "virtual" column

Tags:

dataframe

apache-spark

apache-spark-sql

pyspark

partitioning

arnaud briche

People also ask

1 Answers

zero323

Recent Activity

Donate For Us