Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sorting numeric String in Spark Dataset

Let's assume that I have the following Dataset:

+-----------+----------+
|productCode|    amount|
+-----------+----------+
|      XX-13|       300|
|       XX-1|       250|
|       XX-2|       410|
|       XX-9|        50|
|      XX-10|        35|
|     XX-100|       870|
+-----------+----------+

Where productCode is of String type and the amount is an Int.

If one will try to order this by productCode the result will be (and this is expected because of nature of String comparison):

def orderProducts(product: Dataset[Product]): Dataset[Product] = {
    product.orderBy("productCode")
}

// Output:
+-----------+----------+
|productCode|    amount|
+-----------+----------+
|       XX-1|       250|
|      XX-10|        35|
|     XX-100|       870|
|      XX-13|       300|
|       XX-2|       410|
|       XX-9|        50|
+-----------+----------+

How can I get an output ordered by Integer part of the productCode like below considering Dataset API?

+-----------+----------+
|productCode|    amount|
+-----------+----------+
|       XX-1|       250|
|       XX-2|       410|
|       XX-9|        50|
|      XX-10|        35|
|      XX-13|       300|
|     XX-100|       870|
+-----------+----------+
like image 397
J-Alex Avatar asked Mar 04 '23 22:03

J-Alex


1 Answers

Use the expression in the orderBy. Check this out:

scala> val df = Seq(("XX-13",300),("XX-1",250),("XX-2",410),("XX-9",50),("XX-10",35),("XX-100",870)).toDF("productCode", "amt")
df: org.apache.spark.sql.DataFrame = [productCode: string, amt: int]

scala> df.orderBy(split('productCode,"-")(1).cast("int")).show
+-----------+---+
|productCode|amt|
+-----------+---+
|       XX-1|250|
|       XX-2|410|
|       XX-9| 50|
|      XX-10| 35|
|      XX-13|300|
|     XX-100|870|
+-----------+---+


scala>

With window functions, you could do like

scala> df.withColumn("row1",row_number().over(Window.orderBy(split('productCode,"-")(1).cast("int")))).show(false)
18/12/10 09:25:07 WARN window.WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+-----------+---+----+
|productCode|amt|row1|
+-----------+---+----+
|XX-1       |250|1   |
|XX-2       |410|2   |
|XX-9       |50 |3   |
|XX-10      |35 |4   |
|XX-13      |300|5   |
|XX-100     |870|6   |
+-----------+---+----+


scala>

Note that spark complains of moving all data to single partition.

like image 86
stack0114106 Avatar answered Mar 15 '23 22:03

stack0114106