When I save a Spark DataFrame as a parquet file then read it back, the rows of the resulting DataFrame are not the same as the original as shown in the session below. Is this a "feature" of DataFrames or of parquet files? What would be the best way to save a DataFrame in a row-order preserving manner?
>>> import numpy as np
>>> import pandas as pd
>>> pdf = pd.DataFrame(np.random.random((10,2)))
>>> pdf
0 1
0 0.191519 0.622109
1 0.437728 0.785359
2 0.779976 0.272593
3 0.276464 0.801872
4 0.958139 0.875933
5 0.357817 0.500995
6 0.683463 0.712702
7 0.370251 0.561196
8 0.503083 0.013768
9 0.772827 0.882641
>>> df = sqlContext.createDataFrame(pdf)
>>> df.show()
+-------------------+--------------------+
| 0| 1|
+-------------------+--------------------+
| 0.1915194503788923| 0.6221087710398319|
| 0.4377277390071145| 0.7853585837137692|
| 0.7799758081188035| 0.2725926052826416|
| 0.2764642551430967| 0.8018721775350193|
| 0.9581393536837052| 0.8759326347420947|
|0.35781726995786667| 0.5009951255234587|
| 0.6834629351721363| 0.7127020269829002|
|0.37025075479039493| 0.5611961860656249|
| 0.5030831653078097|0.013768449590682241|
| 0.772826621612374| 0.8826411906361166|
+-------------------+--------------------+
>>> df.write.parquet('test.parquet')
>>> df2 = sqlContext.read.parquet('test.parquet')
>>> df2.show()
+-------------------+--------------------+
| 0| 1|
+-------------------+--------------------+
| 0.6834629351721363| 0.7127020269829002|
|0.37025075479039493| 0.5611961860656249|
| 0.5030831653078097|0.013768449590682241|
| 0.772826621612374| 0.8826411906361166|
| 0.7799758081188035| 0.2725926052826416|
| 0.2764642551430967| 0.8018721775350193|
| 0.1915194503788923| 0.6221087710398319|
| 0.4377277390071145| 0.7853585837137692|
| 0.9581393536837052| 0.8759326347420947|
|0.35781726995786667| 0.5009951255234587|
+-------------------+--------------------+
Yes, when reading from file, Spark maintains the order of records. But when shuffling occurs, the order is not preserved. So in order to preserve the order, either you need to program so that no shuffling occurs in data or you create a seq.
PARQUET-188 suggests that column ordering is not part of the parquet spec, so it confirms that the column order is not honored while Parquet.
You can use either sort() or orderBy() built-in functions to sort a particular DataFrame in ascending or descending order over at least one column.
Parquet files are composed of row groups, header and footer. Each row group contains data from the same columns. The same columns are stored together in each row group: This structure is well-optimized both for fast query performance, as well as low I/O (minimizing the amount of data scanned).
This looks like it's the result of partitioning within Spark (as well as the implementation for show()
). The function show()
essentially wraps some pretty formatting around a call to take()
and there is a good explanation as to how take works here. Since the initially read partitions may not be the same across both calls to show()
, you will see different values.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With