Filter based on another RDD in Spark

Tags:

I would like to keep only the employees which does have a departement ID referenced in the second table.

Employee table
LastName    DepartmentID
Rafferty    31
Jones   33
Heisenberg  33
Robinson    34
Smith   34

Department table
DepartmentID
31  
33

I have tried the following code which does not work:

employee = [['Raffery',31], ['Jones',33], ['Heisenberg',33], ['Robinson',34], ['Smith',34]]
department = [31,33]
employee = sc.parallelize(employee)
department = sc.parallelize(department)
employee.filter(lambda e: e[1] in department).collect()

Py4JError: An error occurred while calling o344.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist

Any ideas? I am using Spark 1.1.0 with Python. However, I would accept a Scala or Python answer.

992

asked Oct 06 '14 10:10

poiuytrez

2 Answers

In this case, what you would like to achieve is to filter at each partition with the data contained in the department table: This would be the basic solution:

val dept = deptRdd.collect.toSet
val employeesWithValidDeptRdd = employeesRdd.filter{case (employee, d) => dept.contains(d)}

If your department data is large, a broadcast variable will improve performance by delivering the data once to all the nodes instead of having to serialize it with each task

val deptBC = sc.broadcast(deptRdd.collect.toSet)
val employeesWithValidDeptRdd = employeesRdd.filter{case (employee, d) => deptBC.value.contains(d)}

Although using join would work, it's a very expensive solution as it will require a distributed shuffle of the data (byKey) to achieve the join. Given that the requirement is a simple filter, sending the data to each partition (as shown above) will provide much better performance.

154

answered Sep 20 '22 18:09

maasg

I finally implemented a solution using a join. I had to add a 0 value to the department to avoid an exception from Spark:

employee = [['Raffery',31], ['Jones',33], ['Heisenberg',33], ['Robinson',34], ['Smith',34]]
department = [31,33]
# invert id and name to get id as the key
employee = sc.parallelize(employee).map(lambda e: (e[1],e[0]))
# add a 0 value to avoid an exception
department = sc.parallelize(department).map(lambda d: (d,0))

employee.join(department).map(lambda e: (e[1][0], e[0])).collect()

output: [('Jones', 33), ('Heisenberg', 33), ('Raffery', 31)]

answered Sep 17 '22 18:09

poiuytrez

Related questions
                            
                                How can I add a test method to a group of Django TestCase-derived classes?
                            
                                Get data from the meta tags using BeautifulSoup
                            
                                How do I get a selected string in from a Tkinter text box?
                            
                                getopt() not enforcing required arguments?
                            
                                APScheduler not starting?
                            
                                Finding key from value in Python dictionary:
                            
                                Django update one field using ModelForm
                            
                                Handling rss redirects with Python/urllib2
                            
                                suppress scapy warning message when importing the module
                            
                                Python regular expression re.match, why this code does not work? [duplicate]
                            
                                sheets of Excel Workbook from a URL into a `pandas.DataFrame`
                            
                                "or" conditional in Python troubles [duplicate]
                            
                                Keyboard shortcuts broken running interactive Python Console from a script
                            
                                Django python manage.py migrate
                            
                                SQLAlchemy: get the object with the most recent date
                            
                                Copy part of dictionary (both its keys and its value)
                            
                                pylibmc installation error using pip in python
                            
                                Maximum size of a dictionary in Python?
                            
                                Django NameError: name 'views' is not defined
                            
                                send email using SMTP SSL/Port 465

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Filter based on another RDD in Spark

Tags:

python

scala

apache-spark

poiuytrez

People also ask

2 Answers

maasg

poiuytrez

Recent Activity

Donate For Us