Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Changing Nulls Ordering in Spark SQL

I need to be able to sort columns in ascending and descending order and also allow nulls to be first or nulls to be last. Using RDDs I could use the sortByKey method with a custom comparator. I was wondering if there is a corresponding approach using the Dataset API. I see how to to add desc/asc to columns but I have no clue on the nulls ordering.

like image 325
John Leach Avatar asked Sep 08 '16 02:09

John Leach


People also ask

How do I change the NULL value in Spark?

In Spark, fill() function of DataFrameNaFunctions class is used to replace NULL values on the DataFrame column with either with zero(0), empty string, space, or any constant literal values.

How does SQL handle nulls in Spark?

In Spark, using filter() or where() functions of DataFrame we can filter rows with NULL values by checking IS NULL or isNULL . These removes all rows with null values on state column and returns the new DataFrame. All above examples returns the same output.

How do you order NULL values?

If you specify the ORDER BY clause, NULL values by default are ordered as less than values that are not NULL. Using the ASC order, a NULL value comes before any non-NULL value; using DESC order, the NULL comes last.

How do you replace nulls in PySpark?

In PySpark, DataFrame. fillna() or DataFrameNaFunctions. fill() is used to replace NULL/None values on all or selected multiple DataFrame columns with either zero(0), empty string, space, or any constant literal values.


2 Answers

You can also do it with the dataset API:

scala>     val df = Seq("a", "b", null).toDF("x")
df: org.apache.spark.sql.DataFrame = [x: string]

scala> df.select('*).orderBy('x.asc_nulls_last).show
+----+
|   x|
+----+
|   a|
|   b|
|null|
+----+


scala> df.select('*).orderBy('x.asc_nulls_first).show
+----+
|   x|
+----+
|null|
|   a|
|   b|
+----+

Same thing works with desc_nulls_last and desc_nulls_first.

like image 148
Sandor Murakozi Avatar answered Oct 21 '22 09:10

Sandor Murakozi


As mentioned by Oleksandr, there was a pull request for this. Now you can optionally use "nulls first" or "nulls last"

scala> spark.sql("select * from spark_10747 order by col3 nulls last").show
+----+----+----+
|col1|col2|col3|
+----+----+----+
|   6|   7|   4|
|   6|  11|   4|
|   6|  15|   8|
|   6|  15|   8|
|   6|   7|   8|
|   6|  12|  10|
|   6|   9|  10|
|   6|  13|null|
|   6|  10|null|
+----+----+----+
like image 36
xmorera Avatar answered Oct 21 '22 09:10

xmorera