Hopefully I'm stupid and this will be easy.
I have a dataframe containing the columns 'url' and 'referrer'.
I want to extract all the referrers that contain the top level domain 'www.mydomain.com' and 'mydomain.co'.
I can use
val filteredDf = unfilteredDf.filter(($"referrer").contains("www.mydomain."))
However, this pulls out the url www.google.co.uk search url that also contains my web domain for some reason. Is there a way, using scala in spark, that I can filter out anything with google in it while keeping the correct results I have?
Thanks
Dean
You can negate predicate using either not or ! so all what's left is to add another condition:
import org.apache.spark.sql.functions.not
df.where($"referrer".contains("www.mydomain.") &&
  not($"referrer".contains("google")))
or separate filter:
df
 .where($"referrer".contains("www.mydomain."))
 .where(!$"referrer".contains("google"))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With