Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using LIKE operator for multiple words in PySpark

I have a DataFrame df in PySpark, like a one shown below -

+-----+--------------------+-------+
|   ID|           customers|country|
+-----+--------------------+-------+
|56   |xyz Limited         |U.K.   |
|66   |ABC  Limited        |U.K.   |
|16   |Sons & Sons         |U.K.   |
|51   |TÜV GmbH            |Germany|
|23   |Mueller GmbH        |Germany|
|97   |Schneider AG        |Germany|
|69   |Sahm UG             |Austria|
+-----+--------------------+-------+

I would like to keep only those rows where ID starts from either 5 or 6. So, I want my final dataframe to look like this -

+-----+--------------------+-------+
|   ID|           customers|country|
+-----+--------------------+-------+
|56   |xyz Limited         |U.K.   |
|66   |ABC  Limited        |U.K.   |
|51   |TÜV GmbH            |Germany|
|69   |Sahm UG             |Austria|
+-----+--------------------+-------+

This can be achieved in many ways and it's not a problem. But, I am interested in learning how this can be done using LIKE statement.

Had I only been interested in those rows where ID starts from 5, it could have been done easily like this -

df=df.where("ID like ('5%')")

My Question: How can I add the second statement like "ID like ('6%')" with OR - | boolean inside where clause? I want to do something like the one shown below, but this code gives an error. So, in nutshell, how can I use multiple boolean statement using LIKE and .where here -

df=df.where("(ID like ('5%')) | (ID like ('6%'))")
like image 364
cph_sto Avatar asked Jan 01 '23 17:01

cph_sto


1 Answers

This works for me

from pyspark.sql import functions as F
df.where(F.col("ID").like('5%') | F.col("ID").like('6%'))
like image 194
Mike Avatar answered May 06 '23 04:05

Mike