Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I order fields of my Row objects in Spark (Python)

I'm creating Row objects in Spark. I do not want my fields to be ordered alphabetically. However, if I do the following they are ordered alphabetically.

row = Row(foo=1, bar=2)

Then it creates an object like the following:

Row(bar=2, foo=1)

When I then create a dataframe on this object, the column order is going to be bar first, foo second, when I'd prefer to have it the other way around.

I know I can use "_1" and "_2" (for "foo" and "bar", respectively) and then assign a schema (with appropriate "foo" and "bar" names). But is there any way to prevent the Row object from ordering them?

like image 518
rye Avatar asked Feb 11 '16 15:02

rye


People also ask

How do I order columns in Spark?

In Spark, we can use either sort() or orderBy() function of DataFrame/Dataset to sort by ascending or descending order based on single or multiple columns, you can also do sorting using Spark SQL sorting functions like asc_nulls_first(), asc_nulls_last(), desc_nulls_first(), desc_nulls_last().

How do you sort a data frame Spark?

You can use either sort() or orderBy() built-in functions to sort a particular DataFrame in ascending or descending order over at least one column.

How do you sort records in PySpark?

Working of OrderBy in PySpark The Default sorting technique used by order is ASC. We can import the PySpark function and use the DESC method to sort the data frame in Descending order. We can sort the elements by passing the columns within the Data Frame, the sorting can be done from one column to multiple columns.

Does order of columns matter in Spark?

As far as I understand, given the column based storage method of spark dfs, the order of the columns really don't have any meaning, they're like keys in a dictionary.


1 Answers

Spark >= 3.0

Fields sorting has been removed with SPARK-29748 (Remove sorting of fields in PySpark SQL Row creation Export), with exception to legacy mode, when following environmental variable is set:

PYSPARK_ROW_FIELD_SORTING_ENABLED=true 

Spark < 3.0

But is there any way to prevent the Row object from ordering them?

There isn't. If you provide kwargs arguments will sorted by name. Sorting is required for deterministic behavior, because Python before 3.6, doesn't preserve the order of the keyword arguments.

Just use plain tuples:

rdd = sc.parallelize([(1, 2)])

and pass the schema as an argument to RDD.toDF (not to be confused with DataFrame.toDF):

rdd.toDF(["foo", "bar"])

or createDataFrame:

from pyspark.sql.types import *

spark.createDataFrame(rdd, ["foo", "bar"])

# With full schema
schema = StructType([
    StructField("foo", IntegerType(), False),
    StructField("bar", IntegerType(), False)])

spark.createDataFrame(rdd, schema)

You can also use namedtuples:

from collections import namedtuple

FooBar = namedtuple("FooBar", ["foo", "bar"])
spark.createDataFrame([FooBar(foo=1, bar=2)])

Finally you can sort columns by select:

sc.parallelize([Row(foo=1, bar=2)]).toDF().select("foo", "bar")
like image 67
zero323 Avatar answered Sep 28 '22 05:09

zero323