Can I convert a Pandas DataFrame to RDD?
if isinstance(data2, pd.DataFrame):
print 'is Dataframe'
else:
print 'is NOT Dataframe'
is DataFrame
Here is the output when trying to use .rdd
dataRDD = data2.rdd
print dataRDD
AttributeError Traceback (most recent call last)
<ipython-input-56-7a9188b07317> in <module>()
----> 1 dataRDD = data2.rdd
2 print dataRDD
/usr/lib64/python2.7/site-packages/pandas/core/generic.pyc in __getattr__(self, name)
2148 return self[name]
2149 raise AttributeError("'%s' object has no attribute '%s'" %
-> 2150 (type(self).__name__, name))
2151
2152 def __setattr__(self, name, value):
AttributeError: 'DataFrame' object has no attribute 'rdd'
I would like to use Pandas Dataframe and not sqlContext to build as I'm not sure if all the functions in Pandas DF are available in Spark. If this is not possible, is there anyone that can provide an example of using Spark DF
rdd is used to convert PySpark DataFrame to RDD; there are several transformations that are not available in DataFrame but present in RDD hence you often required to convert PySpark DataFrame to RDD. Since PySpark 1.3, it provides a property .
Are they being deprecated? The answer is a resounding NO! What's more is you can seamlessly move between DataFrame or Dataset and RDDs at will—by simple API method calls—and DataFrames and Datasets are built on top of RDDs.
When your datasets start getting large, a move to Spark can increase speed and save time. Most data science workflows start with Pandas. Pandas is an awesome library that lets you do a variety of transformations and can handle different kinds of data such as CSVs or JSONs etc.
While RDD offers low-level control over data, Dataset and DataFrame APIs bring structure and high-level abstractions. Keep in mind that transformations from an RDD to a Dataset or DataFrame are easy to execute.
Can I convert a Pandas Dataframe to RDD?
Well, yes you can do it. Pandas Data Frames
pdDF = pd.DataFrame([("foo", 1), ("bar", 2)], columns=("k", "v"))
print pdDF
## k v
## 0 foo 1
## 1 bar 2
can be converted to Spark Data Frames
spDF = sqlContext.createDataFrame(pdDF)
spDF.show()
## +---+-+
## | k|v|
## +---+-+
## |foo|1|
## |bar|2|
## +---+-+
and after that you can easily access underlying RDD
spDF.rdd.first()
## Row(k=u'foo', v=1)
Still, I think you have a wrong idea here. Pandas Data Frame is a local data structure. It is stored and processed locally on the driver. There is no data distribution or parallel processing and it doesn't use RDDs (hence no rdd
attribute). Unlike Spark DataFrame it provides random access capabilities.
Spark DataFrame is distributed data structures using RDDs behind the scenes. It can be accessed using either raw SQL (sqlContext.sql
) or SQL like API (df.where(col("foo") == "bar").groupBy(col("bar")).agg(sum(col("foobar")))
). There is no random access and it is immutable (no equivalent of Pandas inplace
). Every transformation returns new DataFrame.
If this is not possible, is there anyone that can provide an example of using Spark DF
Not really. It is far to broad topic for SO. Spark has a really good documentation and Databricks provides some additional resources. For starters you check these:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With