Return an RDD from takeOrdered, instead of a list

Question

I'm using pyspark to do some data cleaning. A very common operation is to take a small-ish subset of a file and export it for inspection:

(self.spark_context.textFile(old_filepath+filename)
    .takeOrdered(100) 
    .saveAsTextFile(new_filepath+filename))

My problem is that takeOrdered is returning a list instead of an RDD, so saveAsTextFile doesn't work.

AttributeError: 'list' object has no attribute 'saveAsTextFile'

Of course, I could implement my own file writer. Or I could convert the list back into an RDD with parallelize. But I'm trying to be a spark purist here.

Isn't there any way to return an RDD from takeOrdered or an equivalent function?

yurib · Accepted Answer

takeOrdered() is an action and not a transformation so you can't have it return an RDD.
If ordering isn't necessary, the simplest alternative would be sample().
If you do want ordering, you can try some combination of filter() and sortByKey() to reduce the number of elements and sort them. Or, as you suggested, re-parallelize the result of takeOrdered()

Return an RDD from takeOrdered, instead of a list

Tags:

python

apache-spark

rdd

Abe

1 Answers

yurib

Recent Activity

Donate For Us

Return an RDD from takeOrdered, instead of a list

Tags:

python

apache-spark

rdd

Abe

1 Answers

yurib

Related questions

Recent Activity

Donate For Us