I have the following RDD in pyspark and I believe this should be really simple to do but haven't been able to figure it out:
information = [ (10, 'sentence number one'),
(17, 'longer sentence number two') ]
rdd = sc.parallelize(information)
I need to apply a transformation that turns that RDD into this:
[ ('sentence', 10),
('number', 10),
('one', 10),
('longer', 17),
('sentence', 17),
('number', 17),
('two', 17) ]
Basically expand a sentence key, into multiple rows with the words as keys.
I would like to avoid SQL.
Use flatMap
:
rdd.flatMap(lambda x: [(w, x[0]) for w in x[1].split()])
Example:
rdd.flatMap(lambda x: [(w, x[0]) for w in x[1].split()]).collect()
# [('sentence', 10), ('number', 10), ('one', 10), ('longer', 17), ('sentence', 17), ('number', 17), ('two', 17)]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With