I have a dataframe. I need to calculate the Max length of the String value in a column and print both the value and its length.
I have written the below code but the output here is the max length only but not its corresponding value. This How to get max length of string column from dataframe using scala? did help me out in getting the below query.
df.agg(max(length(col("city")))).show()
Use row_number() window function on length('city) desc order. Then filter out only the first row_number column and add length('city) column to dataframe.
char_length. char_length(expr) - Returns the character length of string data or number of bytes of binary data. The length of string data includes the trailing spaces. The length of binary data includes binary zeros.
Spark SQL provides a length() function that takes the DataFrame column type as a parameter and returns the number of characters (including trailing spaces) in a string. This function can be used to filter() the DataFrame rows by the length of a column. If the input column is Binary, it returns the number of bytes.
In order to get string length of column in pyspark we will be using length() Function.
Use row_number()
window function on length('city) desc
order.
Then filter out only the first row_number
column and add length('city)
column to dataframe.
Ex:
val df=Seq(("A",1,"US"),("AB",1,"US"),("ABC",1,"US"))
.toDF("city","num","country")
val win=Window.orderBy(length('city).desc)
df.withColumn("str_len",length('city))
.withColumn("rn", row_number().over(win))
.filter('rn===1)
.show(false)
+----+---+-------+-------+---+
|city|num|country|str_len|rn |
+----+---+-------+-------+---+
|ABC |1 |US |3 |1 |
+----+---+-------+-------+---+
(or)
In spark-sql:
df.createOrReplaceTempView("lpl")
spark.sql("select * from (select *,length(city)str_len,row_number() over (order by length(city) desc)rn from lpl)q where q.rn=1")
.show(false)
+----+---+-------+-------+---+
|city|num|country|str_len| rn|
+----+---+-------+-------+---+
| ABC| 1| US| 3| 1|
+----+---+-------+-------+---+
Update:
Find min,max values:
val win_desc=Window.orderBy(length('city).desc)
val win_asc=Window.orderBy(length('city).asc)
df.withColumn("str_len",length('city))
.withColumn("rn", row_number().over(win_desc))
.withColumn("rn1",row_number().over(win_asc))
.filter('rn===1 || 'rn1 === 1)
.show(false)
Result:
+----+---+-------+-------+---+---+
|city|num|country|str_len|rn |rn1|
+----+---+-------+-------+---+---+
|A |1 |US |1 |3 |1 | //min value of string
|ABC |1 |US |3 |1 |3 | //max value of string
+----+---+-------+-------+---+---+
In case you have multiple rows which share the same length, then the solution with the window function won't work, since it filters the first row after ordering.
Another way would be to create a new column with the length of the string, find it's max element and filter the data frame upon the obtained maximum value.
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import spark.implicits._
val df=Seq(("A",1,"US"),("AB",1,"US"),("ABC",1,"US"), ("DEF", 2, "US"))
.toDF("city","num","country")
val dfWithLength = df.withColumn("city_length", length($"city")).cache()
dfWithLength.show()
+----+---+-------+-----------+
|city|num|country|city_length|
+----+---+-------+-----------+
| A| 1| US| 1|
| AB| 1| US| 2|
| ABC| 1| US| 3|
| DEF| 2| US| 3|
+----+---+-------+-----------+
val Row(maxValue: Int) = dfWithLength.agg(max("city_length")).head()
dfWithLength.filter($"city_length" === maxValue).show()
+----+---+-------+-----------+
|city|num|country|city_length|
+----+---+-------+-----------+
| ABC| 1| US| 3|
| DEF| 2| US| 3|
+----+---+-------+-----------+
Find a maximum string length on a string column with pyspark
from pyspark.sql.functions import length, col, max
df2 = df.withColumn("len_Description",length(col("Description"))).groupBy().max("len_Description")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With