I have an RDD with tuples being in the form:
[("a1","b1","c1","d1","e1"), ("a2","b2","c2","d2","e2"), ...
What I want is to transform that into a key-value pair RDD, where the first field will be the first string (key) and the second field a list of strings (value), i.e. I want to turn it to the form:
[("a1",["b1","c1","d1","e1"]), ("a2",["b2","c2","d2","e2"]), ...
>>> rdd = sc.parallelize([("a1","b1","c1","d1","e1"), ("a2","b2","c2","d2","e2")])
>>> result = rdd.map(lambda x: (x[0], list(x[1:])))
>>> print result.collect()
[('a1', ['b1', 'c1', 'd1', 'e1']), ('a2', ['b2', 'c2', 'd2', 'e2'])]
Explanation of lambda x: (x[0], list(x[1:]))
:
x[0]
will make the first element to be the first element of the
output x[1:]
will make all the elements except the first one to be
in the second element list(x[1:])
will force that to be a list
because the default will be a tupleIf you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With