I created a spark DataFrame in a Python paragraph in Zeppelin. <pre class="prettyprint lang-py prettyprint-override"><code>sqlCtx = SQLContext(sc) spDf = sqlCtx.createDataFrame(df) </code></pre> and <code>df</code> is a pandas dataframe <pre class="prettyprint lang-py prettyprint-override"><code>print(type(df)) <class 'pandas.core.frame.DataFrame'> </code></pre> what I want to do is moving <code>spDf</code> from one Python paragraph to another Scala paragraph. It look a reasonable way to do is using <code>z.put</code>. <pre class="prettyprint lang-py prettyprint-override"><code>z.put("spDf", spDf) </code></pre> and I got this error: <pre class="prettyprint lang-none prettyprint-override"><code>AttributeError: 'DataFrame' object has no attribute '_get_object_id' </code></pre> Any suggestion to fix the error? Or any suggestion to move <code>spDf</code>?

You can<code>put</code> internal Java object not a Python wrapper: <pre class="prettyprint lang-py prettyprint-override"><code>%pyspark df = sc.parallelize([(1, "foo"), (2, "bar")]).toDF(["k", "v"]) z.put("df", df._jdf) </code></pre> and then make sure you use correct type: <pre class="prettyprint lang-scala prettyprint-override"><code>val df = z.get("df").asInstanceOf[org.apache.spark.sql.DataFrame] // df: org.apache.spark.sql.DataFrame = [k: bigint, v: string] </code></pre> but it is better to register temporary table: <pre class="prettyprint lang-py prettyprint-override"><code>%pyspark # registerTempTable in Spark 1.x df.createTempView("df") </code></pre> and use <code>SQLContext.table</code> to read it: <pre class="prettyprint lang-scala prettyprint-override"><code>// sqlContext.table in Spark 1.x val df = spark.table("df") </code></pre> <pre class="prettyprint lang-none prettyprint-override"><code>df: org.apache.spark.sql.DataFrame = [k: bigint, v: string] </code></pre> To convert in the opposite direction see Zeppelin: Scala Dataframe to python

Moving Spark DataFrame from Python to Scala whithn Zeppelin

Tags:

python

scala

apache-spark

apache-spark-sql

apache-zeppelin

I created a spark DataFrame in a Python paragraph in Zeppelin.

sqlCtx = SQLContext(sc)
spDf = sqlCtx.createDataFrame(df)

and df is a pandas dataframe

print(type(df))
<class 'pandas.core.frame.DataFrame'>

what I want to do is moving spDf from one Python paragraph to another Scala paragraph. It look a reasonable way to do is using z.put.

z.put("spDf", spDf)

and I got this error:

AttributeError: 'DataFrame' object has no attribute '_get_object_id'

Any suggestion to fix the error? Or any suggestion to move spDf?

690

asked May 16 '16 21:05

MTT

1 Answers

You canput internal Java object not a Python wrapper:

%pyspark

df = sc.parallelize([(1, "foo"), (2, "bar")]).toDF(["k", "v"])
z.put("df", df._jdf)

and then make sure you use correct type:

val df = z.get("df").asInstanceOf[org.apache.spark.sql.DataFrame]
// df: org.apache.spark.sql.DataFrame = [k: bigint, v: string]

but it is better to register temporary table:

%pyspark

# registerTempTable in Spark 1.x
df.createTempView("df")

and use SQLContext.table to read it:

// sqlContext.table in Spark 1.x
val df = spark.table("df")

df: org.apache.spark.sql.DataFrame = [k: bigint, v: string]

To convert in the opposite direction see Zeppelin: Scala Dataframe to python

148

answered Oct 20 '22 03:10

zero323

Related questions
                            
                                OpenCV - Tilted camera and triangulation landmark for stereo vision
                            
                                How to detect if a twin axis has been generated for a matplotlib axis
                            
                                Python Selenium clicking next button until the end
                            
                                The use of model field "verbose name"
                            
                                Calculating and creating percentage column from two columns
                            
                                Loop in Python: Do stuff before first iteration
                            
                                SQLAlchemy order_by string column with int values
                            
                                Django percentage field
                            
                                How do I remove/omit the count column from the dataframe in Pandas?
                            
                                How to add persistent headers in requests calls?
                            
                                How to add an url field to a serializer with Django Rest Framework
                            
                                How to get media_url from tweets using the Tweepy API
                            
                                Matplotlib: Remove scientific notation in subplot
                            
                                Merging a list of Polygons to Multipolygons
                            
                                Dot product sparse matrices
                            
                                How to fetch a substring from text file in python?
                            
                                Compare md5 hashes of two files in python
                            
                                Python/Keras - Creating a callback with one prediction for each epoch
                            
                                Multiprocessing Queue.get() hangs
                            
                                Python 3 operator.div?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With