I have two DataFrames: <code>a</code> and <code>b</code>. This is how they look like: <pre class="prettyprint"><code>a ------- v1 string v2 string roughly hundreds of millions rows b ------- v2 string roughly tens of millions rows </code></pre> I would like to keep rows from DataFrame <code>a</code> where <code>v2</code> is not in <code>b("v2")</code>. I know I could use left join and filter where right side is null or SparkSQL with "not in" construction. I bet there is better approach though.

You can achieve that using the <code>except</code> method of Dataset, wich "Returns a new Dataset containing rows in this Dataset but not in another Dataset"

Spark Scala filter DataFrame where value not in another DataFrame

Tags:

scala

apache-spark

I have two DataFrames: a and b. This is how they look like:

Click to copy

a
-------
v1 string
v2 string

roughly hundreds of millions rows


b
-------
v2 string

roughly tens of millions rows

I would like to keep rows from DataFrame a where v2 is not in b("v2").

I know I could use left join and filter where right side is null or SparkSQL with "not in" construction. I bet there is better approach though.

797

asked Feb 14 '16 23:02

devopslife

1 Answers

You can achieve that using the except method of Dataset, wich "Returns a new Dataset containing rows in this Dataset but not in another Dataset"

174

answered Oct 21 '22 03:10

Javier Alba

Related questions
                            
                                See annotations in Scala reflection
                            
                                cannot add a jar to scala repl with the :cp command
                            
                                Akka non blocking options when an HTTP response is requied
                            
                                Operators as function literals
                            
                                Simple scalatest project won't be compiled
                            
                                How to make implicit conversion work during pattern matching
                            
                                Java/Scala reflection: Get class methods in order and force object init
                            
                                String range in Scala
                            
                                Missing Sized.unapply
                            
                                Spark atop of Docker not accepting jobs
                            
                                What is the colon in the type parameter of a scala class
                            
                                Scala implicit for arbitrarily deep Functor composition
                            
                                Play Framework 2.3.7: Static assets location not working in production
                            
                                Pass a type parameter to be used as argument LabelledGeneric
                            
                                Scala type inference for existential types and type members
                            
                                In Scala is there a way to reference the Companion Object from within an instance of a Case Class?
                            
                                How to COUNT(*) in Slick 3.0?
                            
                                How to compose two parallel Tasks to cancel one task if another one fails?
                            
                                Differences and similarities between Tasks and Commands in SBT
                            
                                sbt: cross-publish from build.sbt

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With