Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replace more than one element in Pyspark

Tags:

regex

pyspark

I want to replace parts of a string in Pyspark using regexp_replace such as 'www.' and '.com'. Is it possible to pass list of elements to be replaced?

my_list = ['www.google.com', 'google.com','www.goole']
from pyspark.sql import Row
from pyspark.sql.functions import regexp_replace
df = sc.parallelize(my_list).map(lambda x: Row(url = x)).toDF()
df.withColumn('site', regexp_replace('url', 'www.', '')).show()

I want to replace both www. and .com in the above example

like image 906
Fisseha Berhane Avatar asked Sep 20 '25 05:09

Fisseha Berhane


1 Answers

Use a pipe | (OR) to combine the two patterns into a single regex pattern www\.|\.com, which will match www. or .com, notice you need to escape . to match it literally since . matches (almost) any character in regex:

df.withColumn('site', regexp_replace('url', 'www\.|\.com', '')).show()
+--------------+------+
|           url|  site|
+--------------+------+
|www.google.com|google|
|    google.com|google|
|     www.goole| goole|
+--------------+------+
like image 177
Psidom Avatar answered Sep 22 '25 20:09

Psidom