Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use matplotlib to plot pyspark sql results

I am new to pyspark. I want to plot the result using matplotlib, but not sure which function to use. I searched for a way to convert sql result to pandas and then use plot.

like image 838
HasanDange Avatar asked Jul 10 '17 03:07

HasanDange


People also ask

Can we use matplotlib in PySpark?

Practical Data Science using PythonSet the figure size and adjust the padding between and around the subplots. Get the instance that is the main Entry Point for Spark functionality. Get the instance of a variant of Spark SQL that integrates with the data stored in Hive. Make a list of records as a tuple.

How do you visualize data in PySpark?

You can visualize a Spark dataframe in Jupyter notebooks by using the display(<dataframe-name>) function. The display() function is supported only on PySpark kernels. The Qviz framework supports 1000 rows and 100 columns. By default, the dataframe is visualized as a table.

Can you use matplotlib on DataFrame?

Matplotlib is an amazing python library which can be used to plot pandas dataframe.


2 Answers

I have found the solution for this. I converted sql dataframe to pandas dataframe and then I was able to plot the graphs. below is the sample code.from

pyspark.sql import Row
from pyspark.sql import HiveContext
import pyspark
from IPython.display import display
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline 
sc = pyspark.SparkContext()
sqlContext = HiveContext(sc)
test_list = [(1, 'hasan'),(2, 'nana'),(3, 'dad'),(4, 'mon')]
rdd = sc.parallelize(test_list)
people = rdd.map(lambda x: Row(id=int(x[0]), name=x[1]))
schemaPeople = sqlContext.createDataFrame(people)
# Register it as a temp table
sqlContext.registerDataFrameAsTable(schemaPeople, "test_table")
df1=sqlContext.sql("Select * from test_table")
pdf1=df1.toPandas()
pdf1.plot(kind='barh',x='name',y='id',colormap='winter_r')
like image 85
HasanDange Avatar answered Sep 18 '22 21:09

HasanDange


For small data, you can use .select() and .collect() on the pyspark DataFrame. collect will give a python list of pyspark.sql.types.Row, which can be indexed. From there you can plot using matplotlib without Pandas, however using Pandas dataframes with df.toPandas() is probably easier.

like image 38
qwr Avatar answered Sep 19 '22 21:09

qwr