Pyspark replace strings in Spark dataframe column

Tags:

I'd like to perform some basic stemming on a Spark Dataframe column by replacing substrings. What's the quickest way to do this?

In my current use case, I have a list of addresses that I want to normalize. For example this dataframe:

id     address 1       2 foo lane 2       10 bar lane 3       24 pants ln

Would become

id     address 1       2 foo ln 2       10 bar ln 3       24 pants ln

443

asked May 04 '16 21:05

Luke

2 Answers

For Spark 1.5 or later, you can use the functions package:

from pyspark.sql.functions import * newDf = df.withColumn('address', regexp_replace('address', 'lane', 'ln'))

Quick explanation:

The function withColumn is called to add (or replace, if the name exists) a column to the data frame.
The function regexp_replace will generate a new column by replacing all substrings that match the pattern.

200

answered Sep 21 '22 04:09

Daniel de Paula

For scala

import org.apache.spark.sql.functions.regexp_replace import org.apache.spark.sql.functions.col data.withColumn("addr_new", regexp_replace(col("addr_line"), "\\*", ""))

answered Sep 23 '22 04:09

loneStar

Related questions
                            
                                Pandas: Converting to numeric, creating NaNs when necessary
                            
                                How do you convert a PIL `Image` to a Django `File`?
                            
                                Using a Python dict for a SQL INSERT statement
                            
                                Extracting zip file contents to specific directory in Python 2.7
                            
                                Python Pip install Error: Unable to find vcvarsall.bat. Tried all solutions [duplicate]
                            
                                Why are python strings and tuples are made immutable?
                            
                                Flask broken pipe with requests
                            
                                Using Amazon s3 boto library, how can I get the URL of a saved key?
                            
                                How do I convert a Python list into a C array by using ctypes?
                            
                                How do I read the first line of a string? [duplicate]
                            
                                Why Anaconda does not recognize conda command?
                            
                                How do you extract a JAR in a UNIX filesystem with a single command and specify its target directory using the JAR command?
                            
                                Writing List of Strings to Excel CSV File in Python
                            
                                Computing an md5 hash of a data structure
                            
                                matplotlib.pyplot, preserve aspect ratio of the plot
                            
                                Pause in Python
                            
                                Implementation HMAC-SHA1 in python
                            
                                opencv.imshow will cause jupyter notebook crash
                            
                                NotImplementedError: Cannot convert a symbolic Tensor (2nd_target:0) to a numpy array
                            
                                Error when calling the metaclass bases: function() argument 1 must be code, not str

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pyspark replace strings in Spark dataframe column

Tags:

python

apache-spark

pyspark

Luke

People also ask

2 Answers

Daniel de Paula

loneStar

Recent Activity

Donate For Us