Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PySpark - Convert an RDD into a key value pair RDD, with the values being in a List

I have an RDD with tuples being in the form:

[("a1","b1","c1","d1","e1"), ("a2","b2","c2","d2","e2"), ...

What I want is to transform that into a key-value pair RDD, where the first field will be the first string (key) and the second field a list of strings (value), i.e. I want to turn it to the form:

[("a1",["b1","c1","d1","e1"]), ("a2",["b2","c2","d2","e2"]), ...
like image 562
nikos Avatar asked Jan 08 '23 08:01

nikos


1 Answers

>>> rdd = sc.parallelize([("a1","b1","c1","d1","e1"), ("a2","b2","c2","d2","e2")])

>>> result = rdd.map(lambda x: (x[0], list(x[1:])))

>>> print result.collect()
[('a1', ['b1', 'c1', 'd1', 'e1']), ('a2', ['b2', 'c2', 'd2', 'e2'])]

Explanation of lambda x: (x[0], list(x[1:])):

  1. x[0] will make the first element to be the first element of the output
  2. x[1:] will make all the elements except the first one to be in the second element
  3. list(x[1:]) will force that to be a list because the default will be a tuple
like image 70
B.Mr.W. Avatar answered Jan 10 '23 18:01

B.Mr.W.