I have a job which loads a DataFrame object and then saves the data to parquet format using the DataFrame <code>partitionBy</code> method. Then I publish the paths created so a subsequent job can use the output. The paths in the output would look like this: <pre class="prettyprint"><code>/ptest/_SUCCESS /ptest/id=0 /ptest/id=0/part-00000-942fb247-1fe4-4147-a41a-bc688f932862.snappy.parquet /ptest/id=0/part-00001-942fb247-1fe4-4147-a41a-bc688f932862.snappy.parquet /ptest/id=0/part-00002-942fb247-1fe4-4147-a41a-bc688f932862.snappy.parquet /ptest/id=1 /ptest/id=1/part-00003-942fb247-1fe4-4147-a41a-bc688f932862.snappy.parquet /ptest/id=1/part-00004-942fb247-1fe4-4147-a41a-bc688f932862.snappy.parquet /ptest/id=1/part-00005-942fb247-1fe4-4147-a41a-bc688f932862.snappy.parquet /ptest/id=3 /ptest/id=3/part-00006-942fb247-1fe4-4147-a41a-bc688f932862.snappy.parquet /ptest/id=3/part-00007-942fb247-1fe4-4147-a41a-bc688f932862.snappy.parquet </code></pre> When I receive new data it is appended to the dataset. The paths are published so jobs which depend on the data can just process the new data. Here's a simplified example of the code: <pre class="prettyprint"><code>>>> rdd = sc.parallelize([(0,1,"A"), (0,1,"B"), (0,2,"C"), (1,2,"D"), (1,10,"E"), (1,20,"F"), (3,18,"G"), (3,18,"H"), (3,18,"I")]) >>> df = sqlContext.createDataFrame(rdd, ["id", "score","letter"]) >>> df.show() +---+-----+------+ | id|score|letter| +---+-----+------+ | 0| 1| A| | 0| 1| B| | 0| 2| C| | 1| 2| D| | 1| 10| E| | 1| 20| F| | 3| 18| G| | 3| 18| H| | 3| 18| I| +---+-----+------+ >>> df.write.partitionBy("id").format("parquet").save("hdfs://localhost:9000/ptest") </code></pre> The problem is when another job tries to read the file using the published paths: <pre class="prettyprint"><code>>>> df2 = spark.read.format("parquet").schema(df2.schema).load("hdfs://localhost:9000/ptest/id=0/") >>> df2.show() +-----+------+ |score|letter| +-----+------+ | 1| A| | 1| B| | 2| C| +-----+------+ </code></pre> As you can see the partition key is missing from the loaded dataset. If I were to publish a schema that jobs could use I can load the file using the schema. The file loads and the partition key exists, but the values are null: <pre class="prettyprint"><code>>>> df2 = spark.read.format("parquet").schema(df.schema).load("hdfs://localhost:9000/ptest/id=0/") >>> df2.show() +----+-----+------+ | id|score|letter| +----+-----+------+ |null| 1| A| |null| 1| B| |null| 2| C| +----+-----+------+ </code></pre> Is there a way to make sure the partition keys are stored w/in the parquet data? I don't want to require other processes to parse the paths to get the keys.

In case like this you should provide <code>basePath</code> <code>option</code>: <pre class="prettyprint"><code>(spark.read .format("parquet") .option("basePath", "hdfs://localhost:9000/ptest/") .load("hdfs://localhost:9000/ptest/id=0/")) </code></pre> which points to the root directory of your data. With <code>basePath</code> <code>DataFrameReader</code> will be aware of the partitioning and adjust schema accordingly.

Why is partition key column missing from DataFrame

Tags:

python

apache-spark

pyspark

I have a job which loads a DataFrame object and then saves the data to parquet format using the DataFrame partitionBy method. Then I publish the paths created so a subsequent job can use the output. The paths in the output would look like this:

/ptest/_SUCCESS
/ptest/id=0
/ptest/id=0/part-00000-942fb247-1fe4-4147-a41a-bc688f932862.snappy.parquet
/ptest/id=0/part-00001-942fb247-1fe4-4147-a41a-bc688f932862.snappy.parquet
/ptest/id=0/part-00002-942fb247-1fe4-4147-a41a-bc688f932862.snappy.parquet
/ptest/id=1
/ptest/id=1/part-00003-942fb247-1fe4-4147-a41a-bc688f932862.snappy.parquet
/ptest/id=1/part-00004-942fb247-1fe4-4147-a41a-bc688f932862.snappy.parquet
/ptest/id=1/part-00005-942fb247-1fe4-4147-a41a-bc688f932862.snappy.parquet
/ptest/id=3
/ptest/id=3/part-00006-942fb247-1fe4-4147-a41a-bc688f932862.snappy.parquet
/ptest/id=3/part-00007-942fb247-1fe4-4147-a41a-bc688f932862.snappy.parquet

When I receive new data it is appended to the dataset. The paths are published so jobs which depend on the data can just process the new data.

Here's a simplified example of the code:

>>> rdd = sc.parallelize([(0,1,"A"), (0,1,"B"), (0,2,"C"), (1,2,"D"), (1,10,"E"), (1,20,"F"), (3,18,"G"), (3,18,"H"), (3,18,"I")])
>>> df = sqlContext.createDataFrame(rdd, ["id", "score","letter"])
>>> df.show()
+---+-----+------+
| id|score|letter|
+---+-----+------+
|  0|    1|     A|
|  0|    1|     B|
|  0|    2|     C|
|  1|    2|     D|
|  1|   10|     E|
|  1|   20|     F|
|  3|   18|     G|
|  3|   18|     H|
|  3|   18|     I|
+---+-----+------+
>>> df.write.partitionBy("id").format("parquet").save("hdfs://localhost:9000/ptest")

The problem is when another job tries to read the file using the published paths:

>>> df2 = spark.read.format("parquet").schema(df2.schema).load("hdfs://localhost:9000/ptest/id=0/")
>>> df2.show()
+-----+------+
|score|letter|
+-----+------+
|    1|     A|
|    1|     B|
|    2|     C|
+-----+------+

As you can see the partition key is missing from the loaded dataset. If I were to publish a schema that jobs could use I can load the file using the schema. The file loads and the partition key exists, but the values are null:

>>> df2 = spark.read.format("parquet").schema(df.schema).load("hdfs://localhost:9000/ptest/id=0/")
>>> df2.show()
+----+-----+------+
|  id|score|letter|
+----+-----+------+
|null|    1|     A|
|null|    1|     B|
|null|    2|     C|
+----+-----+------+

Is there a way to make sure the partition keys are stored w/in the parquet data? I don't want to require other processes to parse the paths to get the keys.

865

asked Apr 03 '17 19:04

Mark J Miller

1 Answers

In case like this you should provide basePath option:

(spark.read
    .format("parquet")
    .option("basePath", "hdfs://localhost:9000/ptest/")
    .load("hdfs://localhost:9000/ptest/id=0/"))

which points to the root directory of your data.

With basePath DataFrameReader will be aware of the partitioning and adjust schema accordingly.

112

answered Oct 18 '22 14:10

zero323

Related questions
                            
                                Obtaining pointer to python memoryview on bytes object
                            
                                Pandas: Get highest value from a column for each unique value in another column
                            
                                Django + Celery tasks on multiple worker nodes
                            
                                Bokeh Server callback from tools
                            
                                Generating signed session cookie value used in Flask
                            
                                Can't import plotly.figure_factory
                            
                                Pandas: Find previous row of matching value
                            
                                Python local variable compile principle
                            
                                How to redirect 404 requests to homepage in Django single page app using Nginx?
                            
                                probability density histogram with Matplotlib doesnt make sense
                            
                                Calculate DATEDIFF in POSTGRES using SQLAlchemy
                            
                                How to append a NumPy array to a NumPy array
                            
                                Converting list to dict python
                            
                                How to avoid auto escaping HTML tags with Jinja2
                            
                                How can I pass keyword arguments as parameters to a function?
                            
                                How setup.py install npm module?
                            
                                Including missing combinations of values in a pandas groupby aggregation
                            
                                Replace missing values in all columns except one in pandas dataframe
                            
                                Multiple select in wagtail admin
                            
                                Python subprocess argument with equal sign and space

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With