Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How access individual element in a tuple on a RDD in pyspark?

Lets say I have a RDD like

[(u'Some1', (u'ABC', 9989)), (u'Some2', (u'XYZ', 235)), (u'Some3', (u'BBB', 5379)), (u'Some4', (u'ABC', 5379))]

I am using map to get one tuple at a time but how can I access to individual element of a tuple like to see if a tuple contains some character. Actually I want to filter out those that contains some character. Here the tuples that contain ABC

I was trying to do something like this but its not helping

def foo(line):
     if(line[1]=="ABC"):
          return (line)


new_data = data.map(foo)

I am new to spark and python as well please help!!

like image 473
Alibh Avatar asked Apr 14 '16 17:04

Alibh


1 Answers

RDDs can be filtered directly. Below will give you all records that contain "ABC" in the 0th position of the 2nd element of the tuple.

new_data = data.filter(lambda x: x[1][0] == "ABC")
like image 186
David Avatar answered Sep 28 '22 09:09

David