Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas Dataframe to RDD

Can I convert a Pandas DataFrame to RDD?

if isinstance(data2, pd.DataFrame):
    print 'is Dataframe'
else:
    print 'is NOT Dataframe'

is DataFrame

Here is the output when trying to use .rdd

dataRDD = data2.rdd
print dataRDD
AttributeError                            Traceback (most recent call last)
<ipython-input-56-7a9188b07317> in <module>()
----> 1 dataRDD = data2.rdd
      2 print dataRDD

/usr/lib64/python2.7/site-packages/pandas/core/generic.pyc in __getattr__(self, name)
   2148                 return self[name]
   2149             raise AttributeError("'%s' object has no attribute '%s'" %
-> 2150                                  (type(self).__name__, name))
   2151 
   2152     def __setattr__(self, name, value):

AttributeError: 'DataFrame' object has no attribute 'rdd'

I would like to use Pandas Dataframe and not sqlContext to build as I'm not sure if all the functions in Pandas DF are available in Spark. If this is not possible, is there anyone that can provide an example of using Spark DF

like image 895
kraster Avatar asked Aug 19 '15 08:08

kraster


People also ask

Can we convert DataFrame to RDD?

rdd is used to convert PySpark DataFrame to RDD; there are several transformations that are not available in DataFrame but present in RDD hence you often required to convert PySpark DataFrame to RDD. Since PySpark 1.3, it provides a property .

Is RDD obsolete?

Are they being deprecated? The answer is a resounding NO! What's more is you can seamlessly move between DataFrame or Dataset and RDDs at will—by simple API method calls—and DataFrames and Datasets are built on top of RDDs.

When should I switch from pandas to Spark?

When your datasets start getting large, a move to Spark can increase speed and save time. Most data science workflows start with Pandas. Pandas is an awesome library that lets you do a variety of transformations and can handle different kinds of data such as CSVs or JSONs etc.

Is RDD better than Dataframes?

While RDD offers low-level control over data, Dataset and DataFrame APIs bring structure and high-level abstractions. Keep in mind that transformations from an RDD to a Dataset or DataFrame are easy to execute.


1 Answers

Can I convert a Pandas Dataframe to RDD?

Well, yes you can do it. Pandas Data Frames

pdDF = pd.DataFrame([("foo", 1), ("bar", 2)], columns=("k", "v"))
print pdDF

##      k  v
## 0  foo  1
## 1  bar  2

can be converted to Spark Data Frames

spDF = sqlContext.createDataFrame(pdDF)
spDF.show()

## +---+-+
## |  k|v|
## +---+-+
## |foo|1|
## |bar|2|
## +---+-+

and after that you can easily access underlying RDD

spDF.rdd.first()

## Row(k=u'foo', v=1)

Still, I think you have a wrong idea here. Pandas Data Frame is a local data structure. It is stored and processed locally on the driver. There is no data distribution or parallel processing and it doesn't use RDDs (hence no rdd attribute). Unlike Spark DataFrame it provides random access capabilities.

Spark DataFrame is distributed data structures using RDDs behind the scenes. It can be accessed using either raw SQL (sqlContext.sql) or SQL like API (df.where(col("foo") == "bar").groupBy(col("bar")).agg(sum(col("foobar")))). There is no random access and it is immutable (no equivalent of Pandas inplace). Every transformation returns new DataFrame.

If this is not possible, is there anyone that can provide an example of using Spark DF

Not really. It is far to broad topic for SO. Spark has a really good documentation and Databricks provides some additional resources. For starters you check these:

  • Introducing DataFrames in Spark for Large Scale Data Science
  • Spark SQL and DataFrame Guide
like image 91
zero323 Avatar answered Oct 06 '22 18:10

zero323