I am trying to create a new dataframe column (b) removing the last character from (a). column a is a string with different lengths so i am trying the following code -
from pyspark.sql.functions import *
df.select(substring('a', 1, length('a') -1 ) ).show()
I get a TypeError: 'Column' object is not callable
it seems to be due to using multiple functions but i cant understand why as these work on their own -
if i hardcode the column length this will work
df.select(substring('a', 1, 10 ) ).show()
or if i use length on it's own it works
df.select(length('a') ).show()
why can i not use multiple functions ? is there an easier method of removing the last character from all rows in a column ?
Using substr
df.select(col('a').substr(lit(0), length(col('a')) - 1))
or using regexp_extract:
df.select(regexp_extract(col('a'), '(.*).$', 1))
Function substring does not work as the parameters pos and len needs to be integers, not columns
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=substring#pyspark.sql.functions.substring
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With