Can I convert a Pandas DataFrame to RDD? <pre class="prettyprint lang-py prettyprint-override"><code>if isinstance(data2, pd.DataFrame): print 'is Dataframe' else: print 'is NOT Dataframe' </code></pre> is DataFrame Here is the output when trying to use .rdd <pre class="prettyprint lang-py prettyprint-override"><code>dataRDD = data2.rdd print dataRDD </code></pre> <pre class="prettyprint lang-none prettyprint-override"><code>AttributeError Traceback (most recent call last) <ipython-input-56-7a9188b07317> in <module>() ----> 1 dataRDD = data2.rdd 2 print dataRDD /usr/lib64/python2.7/site-packages/pandas/core/generic.pyc in __getattr__(self, name) 2148 return self[name] 2149 raise AttributeError("'%s' object has no attribute '%s'" % -> 2150 (type(self).__name__, name)) 2151 2152 def __setattr__(self, name, value): AttributeError: 'DataFrame' object has no attribute 'rdd' </code></pre> I would like to use Pandas Dataframe and not sqlContext to build as I'm not sure if all the functions in Pandas DF are available in Spark. If this is not possible, is there anyone that can provide an example of using Spark DF

<blockquote> Can I convert a Pandas Dataframe to RDD? </blockquote> Well, yes you can do it. Pandas Data Frames <pre class="prettyprint"><code>pdDF = pd.DataFrame([("foo", 1), ("bar", 2)], columns=("k", "v")) print pdDF ## k v ## 0 foo 1 ## 1 bar 2 </code></pre> can be converted to Spark Data Frames <pre class="prettyprint"><code>spDF = sqlContext.createDataFrame(pdDF) spDF.show() ## +---+-+ ## | k|v| ## +---+-+ ## |foo|1| ## |bar|2| ## +---+-+ </code></pre> and after that you can easily access underlying RDD <pre class="prettyprint"><code>spDF.rdd.first() ## Row(k=u'foo', v=1) </code></pre> Still, I think you have a wrong idea here. Pandas Data Frame is a local data structure. It is stored and processed locally on the driver. There is no data distribution or parallel processing and it doesn't use RDDs (hence no <code>rdd</code> attribute). Unlike Spark DataFrame it provides random access capabilities. Spark DataFrame is distributed data structures using RDDs behind the scenes. It can be accessed using either raw SQL (<code>sqlContext.sql</code>) or SQL like API (<code>df.where(col("foo") == "bar").groupBy(col("bar")).agg(sum(col("foobar")))</code>). There is no random access and it is immutable (no equivalent of Pandas <code>inplace</code>). Every transformation returns new DataFrame. <blockquote> If this is not possible, is there anyone that can provide an example of using Spark DF </blockquote> Not really. It is far to broad topic for SO. Spark has a really good documentation and Databricks provides some additional resources. For starters you check these: <ul> <li>Introducing DataFrames in Spark for Large Scale Data Science</li> <li>Spark SQL and DataFrame Guide</li> </ul>

Pandas Dataframe to RDD

Tags:

pandas

dataframe

apache-spark

apache-spark-sql

pyspark

Can I convert a Pandas DataFrame to RDD?

if isinstance(data2, pd.DataFrame):
    print 'is Dataframe'
else:
    print 'is NOT Dataframe'

is DataFrame

Here is the output when trying to use .rdd

dataRDD = data2.rdd
print dataRDD

AttributeError                            Traceback (most recent call last)
<ipython-input-56-7a9188b07317> in <module>()
----> 1 dataRDD = data2.rdd
      2 print dataRDD

/usr/lib64/python2.7/site-packages/pandas/core/generic.pyc in __getattr__(self, name)
   2148                 return self[name]
   2149             raise AttributeError("'%s' object has no attribute '%s'" %
-> 2150                                  (type(self).__name__, name))
   2151 
   2152     def __setattr__(self, name, value):

AttributeError: 'DataFrame' object has no attribute 'rdd'

I would like to use Pandas Dataframe and not sqlContext to build as I'm not sure if all the functions in Pandas DF are available in Spark. If this is not possible, is there anyone that can provide an example of using Spark DF

895

asked Aug 19 '15 08:08

kraster

1 Answers

Can I convert a Pandas Dataframe to RDD?

Well, yes you can do it. Pandas Data Frames

pdDF = pd.DataFrame([("foo", 1), ("bar", 2)], columns=("k", "v"))
print pdDF

##      k  v
## 0  foo  1
## 1  bar  2

can be converted to Spark Data Frames

spDF = sqlContext.createDataFrame(pdDF)
spDF.show()

## +---+-+
## |  k|v|
## +---+-+
## |foo|1|
## |bar|2|
## +---+-+

and after that you can easily access underlying RDD

spDF.rdd.first()

## Row(k=u'foo', v=1)

Still, I think you have a wrong idea here. Pandas Data Frame is a local data structure. It is stored and processed locally on the driver. There is no data distribution or parallel processing and it doesn't use RDDs (hence no rdd attribute). Unlike Spark DataFrame it provides random access capabilities.

Spark DataFrame is distributed data structures using RDDs behind the scenes. It can be accessed using either raw SQL (sqlContext.sql) or SQL like API (df.where(col("foo") == "bar").groupBy(col("bar")).agg(sum(col("foobar")))). There is no random access and it is immutable (no equivalent of Pandas inplace). Every transformation returns new DataFrame.

If this is not possible, is there anyone that can provide an example of using Spark DF

Not really. It is far to broad topic for SO. Spark has a really good documentation and Databricks provides some additional resources. For starters you check these:

Introducing DataFrames in Spark for Large Scale Data Science
Spark SQL and DataFrame Guide

answered Oct 06 '22 18:10

zero323

Related questions
                            
                                reading file with missing values in python pandas
                            
                                Pandas: Drop all records of duplicate indices
                            
                                How to use argmin with groupby in pandas
                            
                                Add new rows to a pandas dataframe
                            
                                HDFStore with string columns gives issues
                            
                                AttributeError: 'TimedeltaProperties' object has no attribute 'years' in Pandas
                            
                                Pandas: Filter dataframe for values that are too frequent or too rare
                            
                                Create multiple columns in Pandas Dataframe from one function
                            
                                Plotting a dataframe as both a 'hist' and 'kde' on the same plot
                            
                                Assign (add) a new column to a dask dataframe based on values of 2 existing columns - involves a conditional statement
                            
                                How to change DPI of my Pandas Dataframe Plot?
                            
                                pandas - 'dataframe' object has no attribute 'str'
                            
                                How to apply string methods to multiple columns of a dataframe
                            
                                Joining two pandas dataframes based on multiple conditions
                            
                                seaborn: Selected KDE bandwidth is 0. Cannot estimate density
                            
                                Pandas error: "IndexError: iloc cannot enlarge its target object"
                            
                                Selecting a subset of a Pandas DataFrame indexed by DatetimeIndex with a list of TimeStamps
                            
                                How to save a pandas dataframe in gzipped format directly? [duplicate]
                            
                                The right way to round pandas.DataFrame?
                            
                                Interpolate (or extrapolate) only small gaps in pandas dataframe

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With