Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

use length function in substring in spark

I am trying to use the length function inside a substring function in a DataFrame but it gives error

val substrDF = testDF.withColumn("newcol", substring($"col", 1, length($"col")-1))

below is the error

 error: type mismatch;
 found   : org.apache.spark.sql.Column
 required: Int

I am using 2.1.

like image 626
satish Avatar asked Sep 21 '17 21:09

satish


People also ask

How do I get the length of a string in Spark SQL?

char_length(expr) - Returns the character length of string data or number of bytes of binary data. The length of string data includes the trailing spaces. The length of binary data includes binary zeros.

How do you use length in PySpark?

Spark SQL provides a length() function that takes the DataFrame column type as a parameter and returns the number of characters (including trailing spaces) in a string. This function can be used to filter() the DataFrame rows by the length of a column. If the input column is Binary, it returns the number of bytes.

How do I find the length of my Spark data frame?

Similar to Python Pandas you can get the Size and Shape of the PySpark (Spark with Python) DataFrame by running count() action to get the number of rows on DataFrame and len(df. columns()) to get the number of columns.


3 Answers

Function "expr" can be used:

val data = List("first", "second", "third")
val df = sparkContext.parallelize(data).toDF("value")
val result = df.withColumn("cutted", expr("substring(value, 1, length(value)-1)"))
result.show(false)

output:

+------+------+
|value |cutted|
+------+------+
|first |firs  |
|second|secon |
|third |thir  |
+------+------+
like image 172
pasha701 Avatar answered Oct 10 '22 07:10

pasha701


You could also use $"COLUMN".substr

val substrDF = testDF.withColumn("newcol", $"col".substr(lit(1), length($"col")-1))

Output:

val testDF = sc.parallelize(List("first", "second", "third")).toDF("col")
val result = testDF.withColumn("newcol", $"col".substr(org.apache.spark.sql.functions.lit(1), length($"col")-1))
result.show(false)
+------+------+
|col   |newcol|
+------+------+
|first |firs  |
|second|secon |
|third |thir  |
+------+------+
like image 36
shabbir hussain Avatar answered Oct 10 '22 05:10

shabbir hussain


You get that error because you the signature of substring is

def substring(str: Column, pos: Int, len: Int): Column 

The len argument that you are passing is a Column, and should be an Int.

You may probably want to implement a simple UDF to solve that problem.

val strTail = udf((str: String) => str.substring(1))
testDF.withColumn("newCol", strTail($"col"))
like image 42
elghoto Avatar answered Oct 10 '22 06:10

elghoto