I'm using pyspark to do some data cleaning. A very common operation is to take a small-ish subset of a file and export it for inspection:
(self.spark_context.textFile(old_filepath+filename)
.takeOrdered(100)
.saveAsTextFile(new_filepath+filename))
My problem is that takeOrdered is returning a list instead of an RDD, so saveAsTextFile doesn't work.
AttributeError: 'list' object has no attribute 'saveAsTextFile'
Of course, I could implement my own file writer. Or I could convert the list back into an RDD with parallelize. But I'm trying to be a spark purist here.
Isn't there any way to return an RDD from takeOrdered or an equivalent function?
takeOrdered()
is an action and not a transformation so you can't have it return an RDD.
If ordering isn't necessary, the simplest alternative would be sample()
.
If you do want ordering, you can try some combination of filter()
and sortByKey()
to reduce the number of elements and sort them. Or, as you suggested, re-parallelize the result of takeOrdered()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With