Filter Spark DataFrame based on another DataFrame that specifies denylist criteria

Tags:

I have a largeDataFrame (multiple columns and billions of rows) and a smallDataFrame (single column and 10,000 rows).

I'd like to filter all the rows from the largeDataFrame whenever the some_identifier column in the largeDataFrame matches one of the rows in the smallDataFrame.

Here's an example:

largeDataFrame

some_idenfitier,first_name 111,bob 123,phil 222,mary 456,sue

smallDataFrame

some_identifier 123 456

desiredOutput

111,bob 222,mary

Here is my ugly solution.

val smallDataFrame2 = smallDataFrame.withColumn("is_bad", lit("bad_row")) val desiredOutput = largeDataFrame.join(broadcast(smallDataFrame2), Seq("some_identifier"), "left").filter($"is_bad".isNull).drop("is_bad")

Is there a cleaner solution?

560

asked Oct 06 '16 04:10

Powers

1 Answers

You'll need to use a left_anti join in this case.

The left anti join is the opposite of a left semi join.

It filters out data from the right table in the left table according to a given key :

largeDataFrame    .join(smallDataFrame, Seq("some_identifier"),"left_anti")    .show // +---------------+----------+ // |some_identifier|first_name| // +---------------+----------+ // |            222|      mary| // |            111|       bob| // +---------------+----------+

answered Sep 22 '22 04:09

eliasah

Related questions
                            
                                How can I convert columns of a pandas DataFrame into a list of lists?
                            
                                Pandas: create dataframe from list of namedtuple
                            
                                R equivalent of SELECT DISTINCT on two or more fields/variables
                            
                                Add a new row to a Pandas DataFrame with specific index name
                            
                                pandas - how to access cell in pandas, equivalent of df[3,4] in R
                            
                                Calculate new column as the mean of other columns pandas [duplicate]
                            
                                Create new Dataframe with empty/null field values
                            
                                Scala: How can I replace value in Dataframes using scala
                            
                                Spark Dataframe :How to add a index Column : Aka Distributed Data Index
                            
                                Create multiple dataframes in loop
                            
                                How can I sort a data.frame with only one column, without losing rownames?
                            
                                How to do group by on a multiindex in pandas?
                            
                                How to convert dataframe into time series?
                            
                                R self reference
                            
                                Pandas DataFrame How to query the closest datetime index?
                            
                                Selecting last n columns and excluding last n columns in dataframe
                            
                                Recombining a list of Data.frames into a single data frame [duplicate]
                            
                                Adding column if it does not exist
                            
                                Multiple Aggregate operations on the same column of a spark dataframe
                            
                                How to dplyr rename a column, by column index?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Filter Spark DataFrame based on another DataFrame that specifies denylist criteria

Tags:

dataframe

apache-spark

apache-spark-sql

pyspark

Powers

People also ask

1 Answers

eliasah

Recent Activity

Donate For Us