Spark DataFrame filtering: retain element belonging to a list

Tags:

I am using Spark 1.5.1 with Scala on Zeppelin notebook.

I have a DataFrame with a column called userID with Long type.
In total I have about 4 million rows and 200,000 unique userID.
I have also a list of 50,000 userID to exclude.
I can easily build the list of userID to retain.

What is the best way to delete all the rows that belong to the users to exclude?

Another way to ask the same question is: what is the best way to keep the rows that belong to the users to retain?

I saw this post and applied its solution (see the code below), but the execution is slow, knowing that I am running SPARK 1.5.1 on my local machine, an I have decent RAM memory of 16GB and the initial DataFrame fits in the memory.

Here is the code that I am applying:

import org.apache.spark.sql.functions.lit
val finalDataFrame = initialDataFrame.where($"userID".in(listOfUsersToKeep.map(lit(_)):_*))

In the code above:

the initialDataFrame has 3885068 rows, each row has 5 columns, one of these columns called userID and it contains Long values.
The listOfUsersToKeep is an Array[Long] and it contains 150,000 Long userID.

I wonder if there is a more efficient solution than the one I am using.

Thanks

436

asked Nov 20 '15 10:11

Rami

1 Answers

You can either use join:

val usersToKeep = sc.parallelize(
  listOfUsersToKeep.map(Tuple1(_))).toDF("userID_")

val finalDataFrame = usersToKeep
  .join(initialDataFrame, $"userID" === $"userID_")
  .drop("userID_")

or a broadcast variable and an UDF:

import org.apache.spark.sql.functions.udf

val usersToKeepBD = sc.broadcast(listOfUsersToKeep.toSet)
val checkUser = udf((id: Long) => usersToKeepBD.value.contains(id))
val finalDataFrame = initialDataFrame.where(checkUser($"userID"))

It should be also possible to broadcast a DataFrame:

import org.apache.spark.sql.functions.broadcast

initialDataFrame.join(broadcast(usersToKeep), $"userID" === $"userID_")

100

answered Sep 26 '22 07:09

zero323

Related questions
                            
                                How to create a Map[String,String] from Map[String, Any] in Scala?
                            
                                Scala: understanding the ::: operator
                            
                                How can I alias a covariant generic type parameter
                            
                                Using regex to access values from a map in keys
                            
                                Scala cast to generic type (for generic numerical function)
                            
                                Spark broadcast error: exceeds spark.akka.frameSize Consider using broadcast
                            
                                Dynamic calls to JavaScript in Scala.js
                            
                                Swagger Data Type Model in ImplicitParam with Play Framework
                            
                                Scala Filter List[Int] Which Exists in other List of Tuples
                            
                                Is it possible to use json4s 3.2.11 with Spark 1.3.0?
                            
                                in scala define generic type based on duck typing?
                            
                                How to compare every element in the RDD with every other element in the RDD ?
                            
                                Prohibit resolving during loading in typesafe config
                            
                                Play Framework: How to sort JSON alphabetically
                            
                                How to find playframework version of a project?
                            
                                scala variable arguments :_*
                            
                                Importing Scala in Java: weird classes & methods showing
                            
                                Akka actorSelection vs actorOf Difference
                            
                                UPDATE Cassandra table using spark cassandra connector
                            
                                How to define a function as generic across all numbers in scala?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark DataFrame filtering: retain element belonging to a list

Tags:

dataframe

scala

apache-spark

apache-spark-sql

apache-zeppelin

Rami

People also ask

1 Answers

zero323

Recent Activity

Donate For Us