How to convert a DataFrame back to normal RDD in pyspark?

Tags:

I need to use the

(rdd.)partitionBy(npartitions, custom_partitioner)

method that is not available on the DataFrame. All of the DataFrame methods refer only to DataFrame results. So then how to create an RDD from the DataFrame data?

Note: this is a change (in 1.3.0) from 1.2.0.

Update from the answer from @dpangmao: the method is .rdd. I was interested to understand if (a) it were public and (b) what are the performance implications.

Well (a) is yes and (b) - well you can see here that there are significant perf implications: a new RDD must be created by invoking mapPartitions :

In dataframe.py (note the file name changed as well (was sql.py):

@property def rdd(self):     """     Return the content of the :class:`DataFrame` as an :class:`RDD`     of :class:`Row` s.     """     if not hasattr(self, '_lazy_rdd'):         jrdd = self._jdf.javaToPython()         rdd = RDD(jrdd, self.sql_ctx._sc, BatchedSerializer(PickleSerializer()))         schema = self.schema          def applySchema(it):             cls = _create_cls(schema)             return itertools.imap(cls, it)          self._lazy_rdd = rdd.mapPartitions(applySchema)      return self._lazy_rdd

768

asked Mar 12 '15 01:03

WestCoastProjects

1 Answers

Use the method .rdd like this:

rdd = df.rdd

answered Sep 28 '22 11:09

dapangmao

Related questions
                            
                                What is the difference between semicolons in JavaScript and in Python?
                            
                                Remove text between () and []
                            
                                Plotting CDF of a pandas series in python
                            
                                Pythonic/efficient way to strip whitespace from every Pandas Data frame cell that has a stringlike object in it
                            
                                Django 1.9 deprecation warnings app_label
                            
                                python and sys.argv
                            
                                Can you change a field label in the Django Admin application?
                            
                                Cleaning build directory in setup.py
                            
                                How to upgrade django?
                            
                                Creating a BAT file for python script
                            
                                PyCharm. /usr/bin/python^M: bad interpreter [duplicate]
                            
                                Cannot import name _uuid_generate_random in heroku django
                            
                                How do I determine if current time is within a specified range using Python's datetime module?
                            
                                Split a string into N equal parts? [duplicate]
                            
                                Efficient way to remove half of the duplicate items in a list
                            
                                Intel MKL FATAL ERROR: Cannot load libmkl_avx2.so or libmkl_def.so
                            
                                Pythonic way of detecting outliers in one dimensional observation data
                            
                                How do I use a C-style for loop in Python?
                            
                                How to set a variable to be "Today's" date in Python/Pandas
                            
                                String count with overlapping occurrences

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to convert a DataFrame back to normal RDD in pyspark?

Tags:

python

apache-spark

pyspark

WestCoastProjects

People also ask

1 Answers

dapangmao

Recent Activity

Donate For Us