Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark replace strings in Spark dataframe column

I'd like to perform some basic stemming on a Spark Dataframe column by replacing substrings. What's the quickest way to do this?

In my current use case, I have a list of addresses that I want to normalize. For example this dataframe:

id     address 1       2 foo lane 2       10 bar lane 3       24 pants ln 

Would become

id     address 1       2 foo ln 2       10 bar ln 3       24 pants ln 
like image 443
Luke Avatar asked May 04 '16 21:05

Luke


People also ask

How do you replace a string in a DataFrame column in PySpark?

By using PySpark SQL function regexp_replace() you can replace a column value with a string for another string/substring. regexp_replace() uses Java regex for matching, if the regex does not match it returns an empty string, the below example replace the street name Rd value with Road string on address column.

How do I change the values in a column in PySpark?

You can do update a PySpark DataFrame Column using withColum(), select() and sql(), since DataFrame's are distributed immutable collection you can't really change the column values however when you change the value using withColumn() or any approach, PySpark returns a new Dataframe with updated values.

How do you replace a DataFrame in PySpark?

The replacement value must be a bool, int, float, string or None. If value is a list, value should be of the same length and type as to_replace . If value is a scalar and to_replace is a sequence, then value is used as a replacement for each item in to_replace . optional list of column names to consider.

What is regexp_replace in PySpark?

regexp_replace is a string function that is used to replace part of a string (substring) value with another string on DataFrame column by using gular expression (regex). This function returns a org. apache. spark.


2 Answers

For Spark 1.5 or later, you can use the functions package:

from pyspark.sql.functions import * newDf = df.withColumn('address', regexp_replace('address', 'lane', 'ln')) 

Quick explanation:

  • The function withColumn is called to add (or replace, if the name exists) a column to the data frame.
  • The function regexp_replace will generate a new column by replacing all substrings that match the pattern.
like image 200
Daniel de Paula Avatar answered Sep 21 '22 04:09

Daniel de Paula


For scala

import org.apache.spark.sql.functions.regexp_replace import org.apache.spark.sql.functions.col data.withColumn("addr_new", regexp_replace(col("addr_line"), "\\*", "")) 
like image 25
loneStar Avatar answered Sep 23 '22 04:09

loneStar