Do parquet files preserve the row order of Spark DataFrames?

Tags:

When I save a Spark DataFrame as a parquet file then read it back, the rows of the resulting DataFrame are not the same as the original as shown in the session below. Is this a "feature" of DataFrames or of parquet files? What would be the best way to save a DataFrame in a row-order preserving manner?

>>> import numpy as np
>>> import pandas as pd
>>> pdf = pd.DataFrame(np.random.random((10,2)))
>>> pdf
          0         1
0  0.191519  0.622109
1  0.437728  0.785359
2  0.779976  0.272593
3  0.276464  0.801872
4  0.958139  0.875933
5  0.357817  0.500995
6  0.683463  0.712702
7  0.370251  0.561196
8  0.503083  0.013768
9  0.772827  0.882641
>>> df = sqlContext.createDataFrame(pdf)
>>> df.show()
+-------------------+--------------------+
|                  0|                   1|
+-------------------+--------------------+
| 0.1915194503788923|  0.6221087710398319|
| 0.4377277390071145|  0.7853585837137692|
| 0.7799758081188035|  0.2725926052826416|
| 0.2764642551430967|  0.8018721775350193|
| 0.9581393536837052|  0.8759326347420947|
|0.35781726995786667|  0.5009951255234587|
| 0.6834629351721363|  0.7127020269829002|
|0.37025075479039493|  0.5611961860656249|
| 0.5030831653078097|0.013768449590682241|
|  0.772826621612374|  0.8826411906361166|
+-------------------+--------------------+
>>> df.write.parquet('test.parquet')
>>> df2 = sqlContext.read.parquet('test.parquet')
>>> df2.show()
+-------------------+--------------------+
|                  0|                   1|
+-------------------+--------------------+
| 0.6834629351721363|  0.7127020269829002|
|0.37025075479039493|  0.5611961860656249|
| 0.5030831653078097|0.013768449590682241|
|  0.772826621612374|  0.8826411906361166|
| 0.7799758081188035|  0.2725926052826416|
| 0.2764642551430967|  0.8018721775350193|
| 0.1915194503788923|  0.6221087710398319|
| 0.4377277390071145|  0.7853585837137692|
| 0.9581393536837052|  0.8759326347420947|
|0.35781726995786667|  0.5009951255234587|
+-------------------+--------------------+

230

asked Oct 08 '15 15:10

Christian Alis

1 Answers

This looks like it's the result of partitioning within Spark (as well as the implementation for show()). The function show() essentially wraps some pretty formatting around a call to take() and there is a good explanation as to how take works here. Since the initially read partitions may not be the same across both calls to show(), you will see different values.

150

answered Oct 10 '22 06:10

Rohan Aletty

Related questions
                            
                                Import TSV File in spark
                            
                                Spark Streaming with large number of streams and models used for analytical processing of RDDs
                            
                                Apache Spark with custom InputFormat for HadoopRDD
                            
                                how to divide rdd data into two in spark?
                            
                                Spark- Saving JavaRDD to Cassandra
                            
                                Spark Combinebykey JAVA lambda expression
                            
                                Scala error Could not find implicit value for parameter
                            
                                How to restrict processing to specified number of cores in spark standalone
                            
                                How to calculate the mean of each pair in an RDD consisting of (Key, [Value]) pairs in Spark?
                            
                                How to create a VertexId in Apache Spark GraphX using a Long data type?
                            
                                error when starting the spark shell
                            
                                java.util.HashMap missing in PySpark session
                            
                                Elasticsearch + Apache Spark performance
                            
                                EMR PySpark: LZO Codec not found
                            
                                Spark streaming + json4s-jackson dependency problems
                            
                                In Apache-spark, how to add the sparse vector?
                            
                                SparkSQL - Lag function?
                            
                                How to config checkpoint to redeploy spark streaming application?
                            
                                Spark + Kafka integration - mapping of Kafka partitions to RDD partitions
                            
                                Spark - Adding JDBC Driver JAR to Google Dataproc

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Do parquet files preserve the row order of Spark DataFrames?

Tags:

apache-spark

apache-spark-sql

parquet

Christian Alis

People also ask

1 Answers

Rohan Aletty

Recent Activity

Donate For Us