Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Do parquet files preserve the row order of Spark DataFrames?

When I save a Spark DataFrame as a parquet file then read it back, the rows of the resulting DataFrame are not the same as the original as shown in the session below. Is this a "feature" of DataFrames or of parquet files? What would be the best way to save a DataFrame in a row-order preserving manner?

>>> import numpy as np
>>> import pandas as pd
>>> pdf = pd.DataFrame(np.random.random((10,2)))
>>> pdf
          0         1
0  0.191519  0.622109
1  0.437728  0.785359
2  0.779976  0.272593
3  0.276464  0.801872
4  0.958139  0.875933
5  0.357817  0.500995
6  0.683463  0.712702
7  0.370251  0.561196
8  0.503083  0.013768
9  0.772827  0.882641
>>> df = sqlContext.createDataFrame(pdf)
>>> df.show()
+-------------------+--------------------+
|                  0|                   1|
+-------------------+--------------------+
| 0.1915194503788923|  0.6221087710398319|
| 0.4377277390071145|  0.7853585837137692|
| 0.7799758081188035|  0.2725926052826416|
| 0.2764642551430967|  0.8018721775350193|
| 0.9581393536837052|  0.8759326347420947|
|0.35781726995786667|  0.5009951255234587|
| 0.6834629351721363|  0.7127020269829002|
|0.37025075479039493|  0.5611961860656249|
| 0.5030831653078097|0.013768449590682241|
|  0.772826621612374|  0.8826411906361166|
+-------------------+--------------------+
>>> df.write.parquet('test.parquet')
>>> df2 = sqlContext.read.parquet('test.parquet')
>>> df2.show()
+-------------------+--------------------+
|                  0|                   1|
+-------------------+--------------------+
| 0.6834629351721363|  0.7127020269829002|
|0.37025075479039493|  0.5611961860656249|
| 0.5030831653078097|0.013768449590682241|
|  0.772826621612374|  0.8826411906361166|
| 0.7799758081188035|  0.2725926052826416|
| 0.2764642551430967|  0.8018721775350193|
| 0.1915194503788923|  0.6221087710398319|
| 0.4377277390071145|  0.7853585837137692|
| 0.9581393536837052|  0.8759326347420947|
|0.35781726995786667|  0.5009951255234587|
+-------------------+--------------------+
like image 230
Christian Alis Avatar asked Oct 08 '15 15:10

Christian Alis


People also ask

Does Spark preserve row order?

Yes, when reading from file, Spark maintains the order of records. But when shuffling occurs, the order is not preserved. So in order to preserve the order, either you need to program so that no shuffling occurs in data or you create a seq.

Does Parquet preserve column order?

PARQUET-188 suggests that column ordering is not part of the parquet spec, so it confirms that the column order is not honored while Parquet.

How do you maintain order in Pyspark DataFrame?

You can use either sort() or orderBy() built-in functions to sort a particular DataFrame in ascending or descending order over at least one column.

How does Parquet format store data?

Parquet files are composed of row groups, header and footer. Each row group contains data from the same columns. The same columns are stored together in each row group: This structure is well-optimized both for fast query performance, as well as low I/O (minimizing the amount of data scanned).


1 Answers

This looks like it's the result of partitioning within Spark (as well as the implementation for show()). The function show() essentially wraps some pretty formatting around a call to take() and there is a good explanation as to how take works here. Since the initially read partitions may not be the same across both calls to show(), you will see different values.

like image 150
Rohan Aletty Avatar answered Oct 10 '22 06:10

Rohan Aletty