Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find the max String length of a column in Spark using dataframe?

I have a dataframe. I need to calculate the Max length of the String value in a column and print both the value and its length.

I have written the below code but the output here is the max length only but not its corresponding value. This How to get max length of string column from dataframe using scala? did help me out in getting the below query.

 df.agg(max(length(col("city")))).show()
like image 594
Shashank V C Avatar asked May 11 '19 15:05

Shashank V C


People also ask

How do I find the maximum length of a column in spark SQL?

Use row_number() window function on length('city) desc order. Then filter out only the first row_number column and add length('city) column to dataframe.

How do I find the length of a string in a spark data frame?

char_length. char_length(expr) - Returns the character length of string data or number of bytes of binary data. The length of string data includes the trailing spaces. The length of binary data includes binary zeros.

How do I get the length of a column in spark DataFrame?

Spark SQL provides a length() function that takes the DataFrame column type as a parameter and returns the number of characters (including trailing spaces) in a string. This function can be used to filter() the DataFrame rows by the length of a column. If the input column is Binary, it returns the number of bytes.

How do you find the length of a string column in PySpark?

In order to get string length of column in pyspark we will be using length() Function.


3 Answers

Use row_number() window function on length('city) desc order.

Then filter out only the first row_number column and add length('city) column to dataframe.

Ex:

val df=Seq(("A",1,"US"),("AB",1,"US"),("ABC",1,"US"))
       .toDF("city","num","country")

val win=Window.orderBy(length('city).desc)

df.withColumn("str_len",length('city))
  .withColumn("rn", row_number().over(win))
  .filter('rn===1)
  .show(false)

+----+---+-------+-------+---+
|city|num|country|str_len|rn |
+----+---+-------+-------+---+
|ABC |1  |US     |3      |1  |
+----+---+-------+-------+---+

(or)

In spark-sql:

df.createOrReplaceTempView("lpl")
spark.sql("select * from (select *,length(city)str_len,row_number() over (order by length(city) desc)rn from lpl)q where q.rn=1")
.show(false)
+----+---+-------+-------+---+
|city|num|country|str_len| rn|
+----+---+-------+-------+---+
| ABC|  1|     US|      3|  1|
+----+---+-------+-------+---+

Update:

Find min,max values:

val win_desc=Window.orderBy(length('city).desc)
val win_asc=Window.orderBy(length('city).asc)
df.withColumn("str_len",length('city))
  .withColumn("rn", row_number().over(win_desc))
  .withColumn("rn1",row_number().over(win_asc))
  .filter('rn===1 || 'rn1 === 1)
  .show(false)

Result:

+----+---+-------+-------+---+---+
|city|num|country|str_len|rn |rn1|
+----+---+-------+-------+---+---+
|A   |1  |US     |1      |3  |1  | //min value of string
|ABC |1  |US     |3      |1  |3  | //max value of string
+----+---+-------+-------+---+---+
like image 91
notNull Avatar answered Oct 13 '22 20:10

notNull


In case you have multiple rows which share the same length, then the solution with the window function won't work, since it filters the first row after ordering.

Another way would be to create a new column with the length of the string, find it's max element and filter the data frame upon the obtained maximum value.

import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import spark.implicits._

val df=Seq(("A",1,"US"),("AB",1,"US"),("ABC",1,"US"), ("DEF", 2, "US"))
       .toDF("city","num","country")

val dfWithLength = df.withColumn("city_length", length($"city")).cache()

dfWithLength.show()

+----+---+-------+-----------+
|city|num|country|city_length|
+----+---+-------+-----------+
|   A|  1|     US|          1|
|  AB|  1|     US|          2|
| ABC|  1|     US|          3|
| DEF|  2|     US|          3|
+----+---+-------+-----------+

val Row(maxValue: Int) = dfWithLength.agg(max("city_length")).head()

dfWithLength.filter($"city_length" === maxValue).show()

+----+---+-------+-----------+
|city|num|country|city_length|
+----+---+-------+-----------+
| ABC|  1|     US|          3|
| DEF|  2|     US|          3|
+----+---+-------+-----------+

like image 38
sanyi14ka Avatar answered Oct 13 '22 22:10

sanyi14ka


Find a maximum string length on a string column with pyspark

from pyspark.sql.functions import length, col, max

df2 = df.withColumn("len_Description",length(col("Description"))).groupBy().max("len_Description")
like image 24
Dac Toan Ho Avatar answered Oct 13 '22 22:10

Dac Toan Ho