Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert a DataFrame back to normal RDD in pyspark?

I need to use the

(rdd.)partitionBy(npartitions, custom_partitioner) 

method that is not available on the DataFrame. All of the DataFrame methods refer only to DataFrame results. So then how to create an RDD from the DataFrame data?

Note: this is a change (in 1.3.0) from 1.2.0.

Update from the answer from @dpangmao: the method is .rdd. I was interested to understand if (a) it were public and (b) what are the performance implications.

Well (a) is yes and (b) - well you can see here that there are significant perf implications: a new RDD must be created by invoking mapPartitions :

In dataframe.py (note the file name changed as well (was sql.py):

@property def rdd(self):     """     Return the content of the :class:`DataFrame` as an :class:`RDD`     of :class:`Row` s.     """     if not hasattr(self, '_lazy_rdd'):         jrdd = self._jdf.javaToPython()         rdd = RDD(jrdd, self.sql_ctx._sc, BatchedSerializer(PickleSerializer()))         schema = self.schema          def applySchema(it):             cls = _create_cls(schema)             return itertools.imap(cls, it)          self._lazy_rdd = rdd.mapPartitions(applySchema)      return self._lazy_rdd 
like image 768
WestCoastProjects Avatar asked Mar 12 '15 01:03

WestCoastProjects


People also ask

Can we create DataFrame to RDD?

Convert Using createDataFrame Method This method can take an RDD and create a DataFrame from it. The createDataFrame is an overloaded method, and we can call the method by passing the RDD alone or with a schema. We can observe the column names are following a default sequence of names based on a default template.

How do you get rid of RDD?

You should call thisRDD. unpersist() to remove the cached data.

How do you convert RDD to string in Pyspark?

Try x = all_coord_iso_rdd. take(4) . Then print(type(x)) - you'll see that it's a list (of tuples). Then just convert it to string.


1 Answers

Use the method .rdd like this:

rdd = df.rdd 
like image 76
dapangmao Avatar answered Sep 28 '22 11:09

dapangmao