How to find the max String length of a column in Spark using dataframe?

Tags:

I have a dataframe. I need to calculate the Max length of the String value in a column and print both the value and its length.

I have written the below code but the output here is the max length only but not its corresponding value. This How to get max length of string column from dataframe using scala? did help me out in getting the below query.

 df.agg(max(length(col("city")))).show()

594

asked May 11 '19 15:05

Shashank V C

3 Answers

Use row_number() window function on length('city) desc order.

Then filter out only the first row_number column and add length('city) column to dataframe.

Ex:

val df=Seq(("A",1,"US"),("AB",1,"US"),("ABC",1,"US"))
       .toDF("city","num","country")

val win=Window.orderBy(length('city).desc)

df.withColumn("str_len",length('city))
  .withColumn("rn", row_number().over(win))
  .filter('rn===1)
  .show(false)

+----+---+-------+-------+---+
|city|num|country|str_len|rn |
+----+---+-------+-------+---+
|ABC |1  |US     |3      |1  |
+----+---+-------+-------+---+

(or)

In spark-sql:

df.createOrReplaceTempView("lpl")
spark.sql("select * from (select *,length(city)str_len,row_number() over (order by length(city) desc)rn from lpl)q where q.rn=1")
.show(false)
+----+---+-------+-------+---+
|city|num|country|str_len| rn|
+----+---+-------+-------+---+
| ABC|  1|     US|      3|  1|
+----+---+-------+-------+---+

Update:

Find min,max values:

val win_desc=Window.orderBy(length('city).desc)
val win_asc=Window.orderBy(length('city).asc)
df.withColumn("str_len",length('city))
  .withColumn("rn", row_number().over(win_desc))
  .withColumn("rn1",row_number().over(win_asc))
  .filter('rn===1 || 'rn1 === 1)
  .show(false)

Result:

+----+---+-------+-------+---+---+
|city|num|country|str_len|rn |rn1|
+----+---+-------+-------+---+---+
|A   |1  |US     |1      |3  |1  | //min value of string
|ABC |1  |US     |3      |1  |3  | //max value of string
+----+---+-------+-------+---+---+

answered Oct 13 '22 20:10

notNull

In case you have multiple rows which share the same length, then the solution with the window function won't work, since it filters the first row after ordering.

Another way would be to create a new column with the length of the string, find it's max element and filter the data frame upon the obtained maximum value.

import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import spark.implicits._

val df=Seq(("A",1,"US"),("AB",1,"US"),("ABC",1,"US"), ("DEF", 2, "US"))
       .toDF("city","num","country")

val dfWithLength = df.withColumn("city_length", length($"city")).cache()

dfWithLength.show()

+----+---+-------+-----------+
|city|num|country|city_length|
+----+---+-------+-----------+
|   A|  1|     US|          1|
|  AB|  1|     US|          2|
| ABC|  1|     US|          3|
| DEF|  2|     US|          3|
+----+---+-------+-----------+

val Row(maxValue: Int) = dfWithLength.agg(max("city_length")).head()

dfWithLength.filter($"city_length" === maxValue).show()

+----+---+-------+-----------+
|city|num|country|city_length|
+----+---+-------+-----------+
| ABC|  1|     US|          3|
| DEF|  2|     US|          3|
+----+---+-------+-----------+

answered Oct 13 '22 22:10

sanyi14ka

Find a maximum string length on a string column with pyspark

from pyspark.sql.functions import length, col, max

df2 = df.withColumn("len_Description",length(col("Description"))).groupBy().max("len_Description")

answered Oct 13 '22 22:10

Dac Toan Ho

Related questions
                            
                                Functional way to take element in a list until a limit in Scala
                            
                                How do I actually run the Gatling test via SBT
                            
                                Use circe to preprocess dot-notation style fields
                            
                                Best practice to create SparkSession object in Scala to use both in unittest and spark-submit
                            
                                Scala : Why can't we do super.val?
                            
                                How can I obtain the DAG of an Apache Spark job without running it?
                            
                                How to get creation date of a file using Scala
                            
                                unresolved dependency: com.artima.supersafe#supersafe_2.12.4;1.1.3: not found
                            
                                Converting a list of either to a cats ValidatedNel
                            
                                Is there a way to validate the syntax of raw spark sql query?
                            
                                What is the purpose of {} in the returning type of a function?
                            
                                How to package jar to a given directory in sbt?
                            
                                what is equivalent function of Map.compute in scala.collection.mutable.Map
                            
                                Why can't I cast an instance of a String to Iterable[Char] in Scala
                            
                                How to stop execution in for-comprehension if Option is None using cats IO?
                            
                                java.lang.UnsupportedOperationExceptionfieldIndex on a Row without schema is undefined: Exception on row.getAs[String]
                            
                                Spark decimal type precision loss
                            
                                Scala: Difference between 'type A = XXX' and 'final type A = XX'?
                            
                                How do I get a spark dataframe to print it's explain plan to a string
                            
                                Circe parse json from snake case keys

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to find the max String length of a column in Spark using dataframe?

Tags:

scala

apache-spark

apache-spark-sql