PySpark

Question

I have an RDD with tuples being in the form:

[("a1","b1","c1","d1","e1"), ("a2","b2","c2","d2","e2"), ...

What I want is to transform that into a key-value pair RDD, where the first field will be the first string (key) and the second field a list of strings (value), i.e. I want to turn it to the form:

[("a1",["b1","c1","d1","e1"]), ("a2",["b2","c2","d2","e2"]), ...

B.Mr.W. · Accepted Answer

>>> rdd = sc.parallelize([("a1","b1","c1","d1","e1"), ("a2","b2","c2","d2","e2")])

>>> result = rdd.map(lambda x: (x[0], list(x[1:])))

>>> print result.collect()
[('a1', ['b1', 'c1', 'd1', 'e1']), ('a2', ['b2', 'c2', 'd2', 'e2'])]

Explanation of lambda x: (x[0], list(x[1:])):

x[0] will make the first element to be the first element of the output
x[1:] will make all the elements except the first one to be in the second element
list(x[1:]) will force that to be a list because the default will be a tuple

PySpark - Convert an RDD into a key value pair RDD, with the values being in a List

Tags:

key-value

apache-spark

rdd

nikos

1 Answers

B.Mr.W.

Recent Activity

Donate For Us