Why is partition key column missing from DataFrame

I have a job which loads a DataFrame object and then saves the data to parquet format using the DataFrame partitionBy method. Then I publish the paths created so a subsequent job can use the output. The paths in the output would look like this:


When I receive new data it is appended to the dataset. The paths are published so jobs which depend on the data can just process the new data.

Here's a simplified example of the code:

>>> rdd = sc.parallelize([(0,1,"A"), (0,1,"B"), (0,2,"C"), (1,2,"D"), (1,10,"E"), (1,20,"F"), (3,18,"G"), (3,18,"H"), (3,18,"I")])
>>> df = sqlContext.createDataFrame(rdd, ["id", "score","letter"])
>>> df.show()
| id|score|letter|
|  0|    1|     A|
|  0|    1|     B|
|  0|    2|     C|
|  1|    2|     D|
|  1|   10|     E|
|  1|   20|     F|
|  3|   18|     G|
|  3|   18|     H|
|  3|   18|     I|
>>> df.write.partitionBy("id").format("parquet").save("hdfs://localhost:9000/ptest")

The problem is when another job tries to read the file using the published paths:

>>> df2 = spark.read.format("parquet").schema(df2.schema).load("hdfs://localhost:9000/ptest/id=0/")
>>> df2.show()
|    1|     A|
|    1|     B|
|    2|     C|

As you can see the partition key is missing from the loaded dataset. If I were to publish a schema that jobs could use I can load the file using the schema. The file loads and the partition key exists, but the values are null:

>>> df2 = spark.read.format("parquet").schema(df.schema).load("hdfs://localhost:9000/ptest/id=0/")
>>> df2.show()
|  id|score|letter|
|null|    1|     A|
|null|    1|     B|
|null|    2|     C|

Is there a way to make sure the partition keys are stored w/in the parquet data? I don't want to require other processes to parse the paths to get the keys.

In case like this you should provide basePath option:

    .option("basePath", "hdfs://localhost:9000/ptest/")

which points to the root directory of your data.

With basePath DataFrameReader will be aware of the partitioning and adjust schema accordingly.

