I have a DataFrame df in PySpark, like a one shown below -
+-----+--------------------+-------+
|   ID|           customers|country|
+-----+--------------------+-------+
|56   |xyz Limited         |U.K.   |
|66   |ABC  Limited        |U.K.   |
|16   |Sons & Sons         |U.K.   |
|51   |TÜV GmbH            |Germany|
|23   |Mueller GmbH        |Germany|
|97   |Schneider AG        |Germany|
|69   |Sahm UG             |Austria|
+-----+--------------------+-------+
I would like to keep only those rows where ID starts from either 5 or 6. So, I want my final dataframe to look like this -
+-----+--------------------+-------+
|   ID|           customers|country|
+-----+--------------------+-------+
|56   |xyz Limited         |U.K.   |
|66   |ABC  Limited        |U.K.   |
|51   |TÜV GmbH            |Germany|
|69   |Sahm UG             |Austria|
+-----+--------------------+-------+
This can be achieved in many ways and it's not a problem. But, I am interested in learning how this can be done using LIKE statement. 
Had I only been interested in those rows where ID starts from 5, it could have been done easily like this -
df=df.where("ID like ('5%')")
My Question: How can I add the second statement like "ID like ('6%')" with OR - | boolean inside where clause? I want to do something like the one shown below, but this code gives an error. So, in nutshell, how can I use multiple boolean statement using LIKE and .where here  -
df=df.where("(ID like ('5%')) | (ID like ('6%'))")
                This works for me
from pyspark.sql import functions as F
df.where(F.col("ID").like('5%') | F.col("ID").like('6%'))
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With