I'm creating Row objects in Spark. I do not want my fields to be ordered alphabetically. However, if I do the following they are ordered alphabetically.
row = Row(foo=1, bar=2)
Then it creates an object like the following:
Row(bar=2, foo=1)
When I then create a dataframe on this object, the column order is going to be bar first, foo second, when I'd prefer to have it the other way around.
I know I can use "_1" and "_2" (for "foo" and "bar", respectively) and then assign a schema (with appropriate "foo" and "bar" names). But is there any way to prevent the Row object from ordering them?
In Spark, we can use either sort() or orderBy() function of DataFrame/Dataset to sort by ascending or descending order based on single or multiple columns, you can also do sorting using Spark SQL sorting functions like asc_nulls_first(), asc_nulls_last(), desc_nulls_first(), desc_nulls_last().
You can use either sort() or orderBy() built-in functions to sort a particular DataFrame in ascending or descending order over at least one column.
Working of OrderBy in PySpark The Default sorting technique used by order is ASC. We can import the PySpark function and use the DESC method to sort the data frame in Descending order. We can sort the elements by passing the columns within the Data Frame, the sorting can be done from one column to multiple columns.
As far as I understand, given the column based storage method of spark dfs, the order of the columns really don't have any meaning, they're like keys in a dictionary.
Spark >= 3.0
Fields sorting has been removed with SPARK-29748 (Remove sorting of fields in PySpark SQL Row creation Export), with exception to legacy mode, when following environmental variable is set:
PYSPARK_ROW_FIELD_SORTING_ENABLED=true
Spark < 3.0
But is there any way to prevent the Row object from ordering them?
There isn't. If you provide kwargs
arguments will sorted by name. Sorting is required for deterministic behavior, because Python before 3.6, doesn't preserve the order of the keyword arguments.
Just use plain tuples:
rdd = sc.parallelize([(1, 2)])
and pass the schema as an argument to RDD.toDF
(not to be confused with DataFrame.toDF
):
rdd.toDF(["foo", "bar"])
or createDataFrame
:
from pyspark.sql.types import *
spark.createDataFrame(rdd, ["foo", "bar"])
# With full schema
schema = StructType([
StructField("foo", IntegerType(), False),
StructField("bar", IntegerType(), False)])
spark.createDataFrame(rdd, schema)
You can also use namedtuples
:
from collections import namedtuple
FooBar = namedtuple("FooBar", ["foo", "bar"])
spark.createDataFrame([FooBar(foo=1, bar=2)])
Finally you can sort columns by select
:
sc.parallelize([Row(foo=1, bar=2)]).toDF().select("foo", "bar")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With