I need to use the
(rdd.)partitionBy(npartitions, custom_partitioner)
method that is not available on the DataFrame. All of the DataFrame methods refer only to DataFrame results. So then how to create an RDD from the DataFrame data?
Note: this is a change (in 1.3.0) from 1.2.0.
Update from the answer from @dpangmao: the method is .rdd. I was interested to understand if (a) it were public and (b) what are the performance implications.
Well (a) is yes and (b) - well you can see here that there are significant perf implications: a new RDD must be created by invoking mapPartitions :
In dataframe.py (note the file name changed as well (was sql.py):
@property def rdd(self): """ Return the content of the :class:`DataFrame` as an :class:`RDD` of :class:`Row` s. """ if not hasattr(self, '_lazy_rdd'): jrdd = self._jdf.javaToPython() rdd = RDD(jrdd, self.sql_ctx._sc, BatchedSerializer(PickleSerializer())) schema = self.schema def applySchema(it): cls = _create_cls(schema) return itertools.imap(cls, it) self._lazy_rdd = rdd.mapPartitions(applySchema) return self._lazy_rdd
Convert Using createDataFrame Method This method can take an RDD and create a DataFrame from it. The createDataFrame is an overloaded method, and we can call the method by passing the RDD alone or with a schema. We can observe the column names are following a default sequence of names based on a default template.
You should call thisRDD. unpersist() to remove the cached data.
Try x = all_coord_iso_rdd. take(4) . Then print(type(x)) - you'll see that it's a list (of tuples). Then just convert it to string.
Use the method .rdd
like this:
rdd = df.rdd
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With