I have a DataFrame df
in PySpark
, like a one shown below -
+-----+--------------------+-------+
| ID| customers|country|
+-----+--------------------+-------+
|56 |xyz Limited |U.K. |
|66 |ABC Limited |U.K. |
|16 |Sons & Sons |U.K. |
|51 |TÜV GmbH |Germany|
|23 |Mueller GmbH |Germany|
|97 |Schneider AG |Germany|
|69 |Sahm UG |Austria|
+-----+--------------------+-------+
I would like to keep only those rows where ID
starts from either 5 or 6. So, I want my final dataframe to look like this -
+-----+--------------------+-------+
| ID| customers|country|
+-----+--------------------+-------+
|56 |xyz Limited |U.K. |
|66 |ABC Limited |U.K. |
|51 |TÜV GmbH |Germany|
|69 |Sahm UG |Austria|
+-----+--------------------+-------+
This can be achieved in many ways and it's not a problem. But, I am interested in learning how this can be done using LIKE
statement.
Had I only been interested in those rows where ID
starts from 5, it could have been done easily like this -
df=df.where("ID like ('5%')")
My Question: How can I add the second statement like "ID like ('6%')"
with OR - |
boolean inside where
clause? I want to do something like the one shown below, but this code gives an error. So, in nutshell, how can I use multiple boolean statement using LIKE and .where
here -
df=df.where("(ID like ('5%')) | (ID like ('6%'))")
This works for me
from pyspark.sql import functions as F
df.where(F.col("ID").like('5%') | F.col("ID").like('6%'))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With