Pyspark n00b... How do I replace a column with a substring of itself? I'm trying to remove a select number of characters from the start and end of string.
from pyspark.sql.functions import substring
import pandas as pd
pdf = pd.DataFrame({'COLUMN_NAME':['_string_','_another string_']})
# this is what i'm looking for...
pdf['COLUMN_NAME_fix']=pdf['COLUMN_NAME'].str[1:-1]
df = sqlContext.createDataFrame(pdf)
# following not working... COLUMN_NAME_fix is blank
df.withColumn('COLUMN_NAME_fix', substring('COLUMN_NAME', 1, -1)).show()
This is pretty close but slightly different Spark Dataframe column with last character of other column. And then there is this LEFT and RIGHT function in PySpark SQL
You can replace column values of PySpark DataFrame by using SQL string functions regexp_replace(), translate(), and overlay() with Python examples.
You can do update a PySpark DataFrame Column using withColum(), select() and sql(), since DataFrame's are distributed immutable collection you can't really change the column values however when you change the value using withColumn() or any approach, PySpark returns a new Dataframe with updated values.
pyspark.sql.functions.substring(str, pos, len)
Substring starts at pos and is of length len when str is String type or returns the slice of byte array that starts at pos in byte and is of length len when str is Binary type
In your code,
df.withColumn('COLUMN_NAME_fix', substring('COLUMN_NAME', 1, -1))
1 is pos and -1 becomes len, length can't be -1 and so it returns null
Try this, (with fixed syntax)
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
udf1 = udf(lambda x:x[1:-1],StringType())
df.withColumn('COLUMN_NAME_fix',udf1('COLUMN_NAME')).show()
try:
df.withColumn('COLUMN_NAME_fix', df['COLUMN_NAME'].substr(1, 10)).show()
where 1 = start position in the string and 10 = number of characters to include from start position (inclusive)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With