Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Return an RDD from takeOrdered, instead of a list

I'm using pyspark to do some data cleaning. A very common operation is to take a small-ish subset of a file and export it for inspection:

(self.spark_context.textFile(old_filepath+filename)
    .takeOrdered(100) 
    .saveAsTextFile(new_filepath+filename))

My problem is that takeOrdered is returning a list instead of an RDD, so saveAsTextFile doesn't work.

AttributeError: 'list' object has no attribute 'saveAsTextFile'

Of course, I could implement my own file writer. Or I could convert the list back into an RDD with parallelize. But I'm trying to be a spark purist here.

Isn't there any way to return an RDD from takeOrdered or an equivalent function?

like image 896
Abe Avatar asked Feb 09 '23 03:02

Abe


1 Answers

takeOrdered() is an action and not a transformation so you can't have it return an RDD.
If ordering isn't necessary, the simplest alternative would be sample().
If you do want ordering, you can try some combination of filter() and sortByKey() to reduce the number of elements and sort them. Or, as you suggested, re-parallelize the result of takeOrdered()

like image 97
yurib Avatar answered Feb 12 '23 01:02

yurib