Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark alter column with substring

Pyspark n00b... How do I replace a column with a substring of itself? I'm trying to remove a select number of characters from the start and end of string.

from pyspark.sql.functions import substring
import pandas as pd
pdf = pd.DataFrame({'COLUMN_NAME':['_string_','_another string_']})
# this is what i'm looking for...
pdf['COLUMN_NAME_fix']=pdf['COLUMN_NAME'].str[1:-1] 

df = sqlContext.createDataFrame(pdf)
# following not working... COLUMN_NAME_fix is blank
df.withColumn('COLUMN_NAME_fix', substring('COLUMN_NAME', 1, -1)).show() 

This is pretty close but slightly different Spark Dataframe column with last character of other column. And then there is this LEFT and RIGHT function in PySpark SQL

like image 837
citynorman Avatar asked Oct 14 '17 23:10

citynorman


People also ask

How do you change a column value in PySpark?

You can replace column values of PySpark DataFrame by using SQL string functions regexp_replace(), translate(), and overlay() with Python examples.

How do you update a column in a DataFrame in PySpark?

You can do update a PySpark DataFrame Column using withColum(), select() and sql(), since DataFrame's are distributed immutable collection you can't really change the column values however when you change the value using withColumn() or any approach, PySpark returns a new Dataframe with updated values.


2 Answers

pyspark.sql.functions.substring(str, pos, len)

Substring starts at pos and is of length len when str is String type or returns the slice of byte array that starts at pos in byte and is of length len when str is Binary type

In your code,

df.withColumn('COLUMN_NAME_fix', substring('COLUMN_NAME', 1, -1))
1 is pos and -1 becomes len, length can't be -1 and so it returns null

Try this, (with fixed syntax)

from pyspark.sql.types import StringType
from pyspark.sql.functions import udf

udf1 = udf(lambda x:x[1:-1],StringType())
df.withColumn('COLUMN_NAME_fix',udf1('COLUMN_NAME')).show()
like image 89
Suresh Avatar answered Sep 28 '22 01:09

Suresh


try:

df.withColumn('COLUMN_NAME_fix', df['COLUMN_NAME'].substr(1, 10)).show()

where 1 = start position in the string and 10 = number of characters to include from start position (inclusive)

like image 22
Grant Shannon Avatar answered Sep 28 '22 03:09

Grant Shannon