Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a way to filter a field not containing something in a spark dataframe using scala?

Hopefully I'm stupid and this will be easy.

I have a dataframe containing the columns 'url' and 'referrer'.

I want to extract all the referrers that contain the top level domain 'www.mydomain.com' and 'mydomain.co'.

I can use

val filteredDf = unfilteredDf.filter(($"referrer").contains("www.mydomain."))

However, this pulls out the url www.google.co.uk search url that also contains my web domain for some reason. Is there a way, using scala in spark, that I can filter out anything with google in it while keeping the correct results I have?

Thanks

Dean

like image 547
Dean Avatar asked Nov 09 '15 11:11

Dean


1 Answers

You can negate predicate using either not or ! so all what's left is to add another condition:

import org.apache.spark.sql.functions.not

df.where($"referrer".contains("www.mydomain.") &&
  not($"referrer".contains("google")))

or separate filter:

df
 .where($"referrer".contains("www.mydomain."))
 .where(!$"referrer".contains("google"))
like image 73
zero323 Avatar answered Sep 30 '22 19:09

zero323