Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark: show histogram of a data frame column

In pandas data frame, I am using the following code to plot histogram of a column:

my_df.hist(column = 'field_1') 

Is there something that can achieve the same goal in pyspark data frame? (I am in Jupyter Notebook) Thanks!

like image 620
Edamame Avatar asked Aug 25 '16 20:08

Edamame


People also ask

What is histogram in pyspark?

PySpark Histogram is a way in PySpark to represent the data frames into numerical data by binding the data with possible aggregation functions. It is a visualization technique that is used to visualize the distribution of variable . PySpark histogram are easy to use and the visualization is quite clear with data points over needed one.

How to display full column content of pyspark Dataframe in output sink?

show (): Function is used to show the Dataframe. n: Number of rows to display. truncate: Through this parameter we can tell the Output sink to display the full column content by setting truncate option to false, by default this value is true. Example 1: Showing full column content of PySpark Dataframe.

How to display distinct column values from Dataframe using pyspark in Python?

In this article, we are going to display the distinct column values from dataframe using pyspark in Python. For this, we are using distinct () and dropDuplicates () functions along with select () function. Let’s create a sample dataframe.

Is there a clean plot () or HIST () function in pyspark DataFrames?

Unfortunately I don't think that there's a clean plot () or hist () function in the PySpark Dataframes API, but I'm hoping that things will eventually go in that direction. For the time being, you could compute the histogram in Spark, and plot the computed histogram as a bar chart. Example:


1 Answers

Unfortunately I don't think that there's a clean plot() or hist() function in the PySpark Dataframes API, but I'm hoping that things will eventually go in that direction.

For the time being, you could compute the histogram in Spark, and plot the computed histogram as a bar chart. Example:

import pandas as pd import pyspark.sql as sparksql  # Let's use UCLA's college admission dataset file_name = "https://stats.idre.ucla.edu/stat/data/binary.csv"  # Creating a pandas dataframe from Sample Data df_pd = pd.read_csv(file_name)  sql_context = sparksql.SQLcontext(sc)  # Creating a Spark DataFrame from a pandas dataframe df_spark = sql_context.createDataFrame(df_pd)  df_spark.show(5) 

This is what the data looks like:

Out[]:    +-----+---+----+----+           |admit|gre| gpa|rank|           +-----+---+----+----+           |    0|380|3.61|   3|           |    1|660|3.67|   3|           |    1|800| 4.0|   1|           |    1|640|3.19|   4|           |    0|520|2.93|   4|           +-----+---+----+----+           only showing top 5 rows   # This is what we want df_pandas.hist('gre'); 

Histogram when plotted in using df_pandas.hist()

# Doing the heavy lifting in Spark. We could leverage the `histogram` function from the RDD api  gre_histogram = df_spark.select('gre').rdd.flatMap(lambda x: x).histogram(11)  # Loading the Computed Histogram into a Pandas Dataframe for plotting pd.DataFrame(     list(zip(*gre_histogram)),      columns=['bin', 'frequency'] ).set_index(     'bin' ).plot(kind='bar'); 

Histogram computed by using RDD.histogram()

like image 186
Shivam Gaur Avatar answered Sep 16 '22 14:09

Shivam Gaur