I am trying to use the length function inside a substring function in a DataFrame
but it gives error
val substrDF = testDF.withColumn("newcol", substring($"col", 1, length($"col")-1))
below is the error
error: type mismatch;
found : org.apache.spark.sql.Column
required: Int
I am using 2.1.
char_length(expr) - Returns the character length of string data or number of bytes of binary data. The length of string data includes the trailing spaces. The length of binary data includes binary zeros.
Spark SQL provides a length() function that takes the DataFrame column type as a parameter and returns the number of characters (including trailing spaces) in a string. This function can be used to filter() the DataFrame rows by the length of a column. If the input column is Binary, it returns the number of bytes.
Similar to Python Pandas you can get the Size and Shape of the PySpark (Spark with Python) DataFrame by running count() action to get the number of rows on DataFrame and len(df. columns()) to get the number of columns.
Function "expr" can be used:
val data = List("first", "second", "third")
val df = sparkContext.parallelize(data).toDF("value")
val result = df.withColumn("cutted", expr("substring(value, 1, length(value)-1)"))
result.show(false)
output:
+------+------+
|value |cutted|
+------+------+
|first |firs |
|second|secon |
|third |thir |
+------+------+
You could also use $"COLUMN".substr
val substrDF = testDF.withColumn("newcol", $"col".substr(lit(1), length($"col")-1))
Output:
val testDF = sc.parallelize(List("first", "second", "third")).toDF("col")
val result = testDF.withColumn("newcol", $"col".substr(org.apache.spark.sql.functions.lit(1), length($"col")-1))
result.show(false)
+------+------+
|col |newcol|
+------+------+
|first |firs |
|second|secon |
|third |thir |
+------+------+
You get that error because you the signature of substring
is
def substring(str: Column, pos: Int, len: Int): Column
The len
argument that you are passing is a Column
, and should be an Int
.
You may probably want to implement a simple UDF to solve that problem.
val strTail = udf((str: String) => str.substring(1))
testDF.withColumn("newCol", strTail($"col"))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With