Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Zeppelin: Scala Dataframe to python

If I have a Scala paragraph with a DataFrame, can I share and use that with python. (As I understand it pyspark uses py4j)

I tried this:

Scala paragraph:

x.printSchema
z.put("xtable", x )

Python paragraph:

%pyspark

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

the_data = z.get("xtable")

print the_data

sns.set()
g = sns.PairGrid(data=the_data,
                 x_vars=dependent_var,
                 y_vars=sensor_measure_columns_names +  operational_settings_columns_names,
                 hue="UnitNumber", size=3, aspect=2.5)
g = g.map(plt.plot, alpha=0.5)
g = g.set(xlim=(300,0))
g = g.add_legend()

Error :

Traceback (most recent call last):
  File "/tmp/zeppelin_pyspark.py", line 222, in <module>
    eval(compiledCode)
  File "<string>", line 15, in <module>
  File "/usr/local/lib/python2.7/dist-packages/seaborn/axisgrid.py", line 1223, in __init__
    hue_names = utils.categorical_order(data[hue], hue_order)
TypeError: 'JavaObject' object has no attribute '__getitem__'

Solution:

%pyspark

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import StringIO
def show(p):
    img = StringIO.StringIO()
    p.savefig(img, format='svg')
    img.seek(0)
    print "%html <div style='width:600px'>" + img.buf + "</div>"

df = sqlContext.table("fd").select()
df.printSchema
pdf = df.toPandas()

g = sns.pairplot(data=pdf,
                 x_vars=["setting1","setting2"],
                 y_vars=["s4", "s3", 
                         "s9", "s8", 
                         "s13", "s6"],
                 hue="id", aspect=2)
show(g)   

cluster visualisation

like image 793
oluies Avatar asked Jan 23 '26 09:01

oluies


1 Answers

You can register DataFrame as a temporary table in Scala:

// registerTempTable in Spark 1.x
df.createTempView("df")

and read it in Python with SQLContext.table:

df = sqlContext.table("df")

If you really want to use put / get you'll have build Python DataFrame from scratch:

z.put("df", df: org.apache.spark.sql.DataFrame)
from pyspark.sql import DataFrame

df = DataFrame(z.get("df"), sqlContext)

To plot with matplotlib you'll have convert DataFrame to a local Python object with either collect or toPandas:

pdf = df.toPandas()

Please note that it will fetch data to the driver.

See also moving Spark DataFrame from Python to Scala whithn Zeppelin

like image 64
zero323 Avatar answered Jan 26 '26 01:01

zero323



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!