I'd like to perform some basic stemming on a Spark Dataframe column by replacing substrings. What's the quickest way to do this?
In my current use case, I have a list of addresses that I want to normalize. For example this dataframe:
id address 1 2 foo lane 2 10 bar lane 3 24 pants ln
Would become
id address 1 2 foo ln 2 10 bar ln 3 24 pants ln
By using PySpark SQL function regexp_replace() you can replace a column value with a string for another string/substring. regexp_replace() uses Java regex for matching, if the regex does not match it returns an empty string, the below example replace the street name Rd value with Road string on address column.
You can do update a PySpark DataFrame Column using withColum(), select() and sql(), since DataFrame's are distributed immutable collection you can't really change the column values however when you change the value using withColumn() or any approach, PySpark returns a new Dataframe with updated values.
The replacement value must be a bool, int, float, string or None. If value is a list, value should be of the same length and type as to_replace . If value is a scalar and to_replace is a sequence, then value is used as a replacement for each item in to_replace . optional list of column names to consider.
regexp_replace is a string function that is used to replace part of a string (substring) value with another string on DataFrame column by using gular expression (regex). This function returns a org. apache. spark.
For Spark 1.5 or later, you can use the functions package:
from pyspark.sql.functions import * newDf = df.withColumn('address', regexp_replace('address', 'lane', 'ln'))
Quick explanation:
withColumn
is called to add (or replace, if the name exists) a column to the data frame. regexp_replace
will generate a new column by replacing all substrings that match the pattern.For scala
import org.apache.spark.sql.functions.regexp_replace import org.apache.spark.sql.functions.col data.withColumn("addr_new", regexp_replace(col("addr_line"), "\\*", ""))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With