In pandas data frame, I am using the following code to plot histogram of a column:
my_df.hist(column = 'field_1')
Is there something that can achieve the same goal in pyspark data frame? (I am in Jupyter Notebook) Thanks!
PySpark Histogram is a way in PySpark to represent the data frames into numerical data by binding the data with possible aggregation functions. It is a visualization technique that is used to visualize the distribution of variable . PySpark histogram are easy to use and the visualization is quite clear with data points over needed one.
show (): Function is used to show the Dataframe. n: Number of rows to display. truncate: Through this parameter we can tell the Output sink to display the full column content by setting truncate option to false, by default this value is true. Example 1: Showing full column content of PySpark Dataframe.
In this article, we are going to display the distinct column values from dataframe using pyspark in Python. For this, we are using distinct () and dropDuplicates () functions along with select () function. Let’s create a sample dataframe.
Unfortunately I don't think that there's a clean plot () or hist () function in the PySpark Dataframes API, but I'm hoping that things will eventually go in that direction. For the time being, you could compute the histogram in Spark, and plot the computed histogram as a bar chart. Example:
Unfortunately I don't think that there's a clean plot()
or hist()
function in the PySpark Dataframes API, but I'm hoping that things will eventually go in that direction.
For the time being, you could compute the histogram in Spark, and plot the computed histogram as a bar chart. Example:
import pandas as pd import pyspark.sql as sparksql # Let's use UCLA's college admission dataset file_name = "https://stats.idre.ucla.edu/stat/data/binary.csv" # Creating a pandas dataframe from Sample Data df_pd = pd.read_csv(file_name) sql_context = sparksql.SQLcontext(sc) # Creating a Spark DataFrame from a pandas dataframe df_spark = sql_context.createDataFrame(df_pd) df_spark.show(5)
This is what the data looks like:
Out[]: +-----+---+----+----+ |admit|gre| gpa|rank| +-----+---+----+----+ | 0|380|3.61| 3| | 1|660|3.67| 3| | 1|800| 4.0| 1| | 1|640|3.19| 4| | 0|520|2.93| 4| +-----+---+----+----+ only showing top 5 rows # This is what we want df_pandas.hist('gre');
Histogram when plotted in using df_pandas.hist()
# Doing the heavy lifting in Spark. We could leverage the `histogram` function from the RDD api gre_histogram = df_spark.select('gre').rdd.flatMap(lambda x: x).histogram(11) # Loading the Computed Histogram into a Pandas Dataframe for plotting pd.DataFrame( list(zip(*gre_histogram)), columns=['bin', 'frequency'] ).set_index( 'bin' ).plot(kind='bar');
Histogram computed by using RDD.histogram()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With