Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert RDD list of lists into one list in pyspark

I have an RDD object, a list of lists, that looks like this (omitted millions of sublists, only left 3 here)

my_tuples = [[('a','b'),('a','c')], 
             [('b','a'),('b','f'),('b','g')], 
             [('zzsx','c'), ('zzsx','q'), ('zzsx','m'), ('zzsx','ay'), ('zzsx','bbt')]]

and I want to convert it into a single list like this

my_list = [('a','b'),('a','c'), ('b','a'),('b','f'),('b','g'), 
           ('zzsx','c'), ('zzsx','q'), ('zzsx','m'), ('zzsx','ay'), ('zzsx','bbt')]

I can't use loops since my_tuples is an RDD object and the size of my_tuples is too large to do it. I'm new to spark, any suggestion is appreciated. Thanks.

like image 711
zoelearns Avatar asked Oct 22 '25 12:10

zoelearns


1 Answers

You can flatten it using flatMap:

rdd.flatMap(lambda l: l)

Since your elements are list, you can just return those lists in the function, as done in the example

[('a', 'b'),
 ('a', 'c'),
 ('b', 'a'),
 ('b', 'f'),
 ('b', 'g'),
 ('zzsx', 'c'),
 ('zzsx', 'q'),
 ('zzsx', 'm'),
 ('zzsx', 'ay'),
 ('zzsx', 'bbt')]
like image 132
ernest_k Avatar answered Oct 25 '25 12:10

ernest_k



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!