I'm creating Row objects in Spark. I do not want my fields to be ordered alphabetically. However, if I do the following they are ordered alphabetically. <pre class="prettyprint"><code>row = Row(foo=1, bar=2) </code></pre> Then it creates an object like the following: <pre class="prettyprint"><code>Row(bar=2, foo=1) </code></pre> When I then create a dataframe on this object, the column order is going to be bar first, foo second, when I'd prefer to have it the other way around. I know I can use "_1" and "_2" (for "foo" and "bar", respectively) and then assign a schema (with appropriate "foo" and "bar" names). But is there any way to prevent the Row object from ordering them?

Spark >= 3.0 Fields sorting has been removed with SPARK-29748 (Remove sorting of fields in PySpark SQL Row creation Export), with exception to legacy mode, when following environmental variable is set: <pre class="prettyprint"><code>PYSPARK_ROW_FIELD_SORTING_ENABLED=true </code></pre> Spark < 3.0 <blockquote> But is there any way to prevent the Row object from ordering them? </blockquote> There isn't. If you provide <code>kwargs</code> arguments will sorted by name. Sorting is required for deterministic behavior, because Python before 3.6, doesn't preserve the order of the keyword arguments. Just use plain tuples: <pre class="prettyprint"><code>rdd = sc.parallelize([(1, 2)]) </code></pre> and pass the schema as an argument to <code>RDD.toDF</code> (not to be confused with <code>DataFrame.toDF</code>): <pre class="prettyprint"><code>rdd.toDF(["foo", "bar"]) </code></pre> or <code>createDataFrame</code>: <pre class="prettyprint"><code>from pyspark.sql.types import * spark.createDataFrame(rdd, ["foo", "bar"]) # With full schema schema = StructType([ StructField("foo", IntegerType(), False), StructField("bar", IntegerType(), False)]) spark.createDataFrame(rdd, schema) </code></pre> You can also use <code>namedtuples</code>: <pre class="prettyprint"><code>from collections import namedtuple FooBar = namedtuple("FooBar", ["foo", "bar"]) spark.createDataFrame([FooBar(foo=1, bar=2)]) </code></pre> Finally you can sort columns by <code>select</code>: <pre class="prettyprint"><code>sc.parallelize([Row(foo=1, bar=2)]).toDF().select("foo", "bar") </code></pre>

How do I order fields of my Row objects in Spark (Python)

Tags:

python

apache-spark

apache-spark-sql

pyspark

pyspark-sql

I'm creating Row objects in Spark. I do not want my fields to be ordered alphabetically. However, if I do the following they are ordered alphabetically.

row = Row(foo=1, bar=2)

Then it creates an object like the following:

Row(bar=2, foo=1)

When I then create a dataframe on this object, the column order is going to be bar first, foo second, when I'd prefer to have it the other way around.

I know I can use "_1" and "_2" (for "foo" and "bar", respectively) and then assign a schema (with appropriate "foo" and "bar" names). But is there any way to prevent the Row object from ordering them?

518

asked Feb 11 '16 15:02

rye

1 Answers

Spark >= 3.0

Fields sorting has been removed with SPARK-29748 (Remove sorting of fields in PySpark SQL Row creation Export), with exception to legacy mode, when following environmental variable is set:

PYSPARK_ROW_FIELD_SORTING_ENABLED=true

Spark < 3.0

But is there any way to prevent the Row object from ordering them?

There isn't. If you provide kwargs arguments will sorted by name. Sorting is required for deterministic behavior, because Python before 3.6, doesn't preserve the order of the keyword arguments.

Just use plain tuples:

rdd = sc.parallelize([(1, 2)])

and pass the schema as an argument to RDD.toDF (not to be confused with DataFrame.toDF):

rdd.toDF(["foo", "bar"])

or createDataFrame:

from pyspark.sql.types import *

spark.createDataFrame(rdd, ["foo", "bar"])

# With full schema
schema = StructType([
    StructField("foo", IntegerType(), False),
    StructField("bar", IntegerType(), False)])

spark.createDataFrame(rdd, schema)

You can also use namedtuples:

from collections import namedtuple

FooBar = namedtuple("FooBar", ["foo", "bar"])
spark.createDataFrame([FooBar(foo=1, bar=2)])

Finally you can sort columns by select:

sc.parallelize([Row(foo=1, bar=2)]).toDF().select("foo", "bar")

answered Sep 28 '22 05:09

zero323

Related questions
                            
                                NumPy array/matrix of mixed types
                            
                                PyCharm + Flask: Unresolved reference - How do I properly import Python namespaces?
                            
                                when does python delete variables?
                            
                                Share list between process in python server
                            
                                Using Python to add a list of files into a zip file
                            
                                How to use ModelMultipleChoiceFilter?
                            
                                Splitting one NumPy array into two arrays
                            
                                How to run raw mongodb commands from pymongo
                            
                                Pass parameter with Python Flask in external Javascript
                            
                                Import caffe error
                            
                                Pandas read_table use first column as index
                            
                                DynamoDBNumberError on trying to insert floating point number using python boto library
                            
                                Group and average NumPy matrix
                            
                                Memory efficient way to split large numpy array into train and test
                            
                                non-blocking lock with 'with' statement
                            
                                How to detect if a point is contained within a bounding rect - opecv & python
                            
                                Luigi Pipeline beginning in S3
                            
                                Callbacks with ctypes (How to call a python function from C)
                            
                                Problems implementing an XOR gate with Neural Nets in Tensorflow
                            
                                Interpolating a closed curve using scipy

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With