In pandas data frame, I am using the following code to plot histogram of a column: <pre class="prettyprint"><code>my_df.hist(column = 'field_1') </code></pre> Is there something that can achieve the same goal in pyspark data frame? (I am in Jupyter Notebook) Thanks!

Unfortunately I don't think that there's a clean <code>plot()</code> or <code>hist()</code> function in the PySpark Dataframes API, but I'm hoping that things will eventually go in that direction. For the time being, you could compute the histogram in Spark, and plot the computed histogram as a bar chart. Example: <pre class="prettyprint"><code>import pandas as pd import pyspark.sql as sparksql # Let's use UCLA's college admission dataset file_name = "https://stats.idre.ucla.edu/stat/data/binary.csv" # Creating a pandas dataframe from Sample Data df_pd = pd.read_csv(file_name) sql_context = sparksql.SQLcontext(sc) # Creating a Spark DataFrame from a pandas dataframe df_spark = sql_context.createDataFrame(df_pd) df_spark.show(5) </code></pre> This is what the data looks like: <pre class="prettyprint"><code>Out[]: +-----+---+----+----+ |admit|gre| gpa|rank| +-----+---+----+----+ | 0|380|3.61| 3| | 1|660|3.67| 3| | 1|800| 4.0| 1| | 1|640|3.19| 4| | 0|520|2.93| 4| +-----+---+----+----+ only showing top 5 rows # This is what we want df_pandas.hist('gre'); </code></pre> Histogram when plotted in using df_pandas.hist() <pre class="prettyprint"><code># Doing the heavy lifting in Spark. We could leverage the `histogram` function from the RDD api gre_histogram = df_spark.select('gre').rdd.flatMap(lambda x: x).histogram(11) # Loading the Computed Histogram into a Pandas Dataframe for plotting pd.DataFrame( list(zip(*gre_histogram)), columns=['bin', 'frequency'] ).set_index( 'bin' ).plot(kind='bar'); </code></pre> Histogram computed by using RDD.histogram()

Pyspark: show histogram of a data frame column

Tags:

python

jupyter-notebook

pyspark

spark-dataframe

In pandas data frame, I am using the following code to plot histogram of a column:

my_df.hist(column = 'field_1')

Is there something that can achieve the same goal in pyspark data frame? (I am in Jupyter Notebook) Thanks!

620

asked Aug 25 '16 20:08

Edamame

1 Answers

Unfortunately I don't think that there's a clean plot() or hist() function in the PySpark Dataframes API, but I'm hoping that things will eventually go in that direction.

For the time being, you could compute the histogram in Spark, and plot the computed histogram as a bar chart. Example:

import pandas as pd import pyspark.sql as sparksql  # Let's use UCLA's college admission dataset file_name = "https://stats.idre.ucla.edu/stat/data/binary.csv"  # Creating a pandas dataframe from Sample Data df_pd = pd.read_csv(file_name)  sql_context = sparksql.SQLcontext(sc)  # Creating a Spark DataFrame from a pandas dataframe df_spark = sql_context.createDataFrame(df_pd)  df_spark.show(5)

This is what the data looks like:

Out[]:    +-----+---+----+----+           |admit|gre| gpa|rank|           +-----+---+----+----+           |    0|380|3.61|   3|           |    1|660|3.67|   3|           |    1|800| 4.0|   1|           |    1|640|3.19|   4|           |    0|520|2.93|   4|           +-----+---+----+----+           only showing top 5 rows   # This is what we want df_pandas.hist('gre');

Histogram when plotted in using df_pandas.hist()

# Doing the heavy lifting in Spark. We could leverage the `histogram` function from the RDD api  gre_histogram = df_spark.select('gre').rdd.flatMap(lambda x: x).histogram(11)  # Loading the Computed Histogram into a Pandas Dataframe for plotting pd.DataFrame(     list(zip(*gre_histogram)),      columns=['bin', 'frequency'] ).set_index(     'bin' ).plot(kind='bar');

Histogram computed by using RDD.histogram()

186

answered Sep 16 '22 14:09

Shivam Gaur

Related questions
                            
                                Do Python regular expressions have an equivalent to Ruby's atomic grouping?
                            
                                Python Pandas Conditional Sum with Groupby
                            
                                Top label for matplotlib colorbars
                            
                                Convert Pandas Dataframe to nested JSON
                            
                                Randomly select from numpy array
                            
                                Jupyter Notebooks not displaying progress bars
                            
                                resize with averaging or rebin a numpy 2d array
                            
                                How to write sort key functions for descending values?
                            
                                Trying to implement python TestSuite
                            
                                Sort a list of tuples by second value, reverse=True and then by key, reverse=False
                            
                                Using len() and def __len__(self): to build a class
                            
                                Python: subprocess and running a bash script with multiple arguments
                            
                                wrapping around slices in Python / numpy
                            
                                vectorize conditional assignment in pandas dataframe
                            
                                Python equivalent of golang's defer statement
                            
                                TensorFlow 'module' object has no attribute 'global_variables_initializer'
                            
                                Regular expression parsing a binary file?
                            
                                How to get the caller class name inside a function of another class in python?
                            
                                Convert set to string and vice versa
                            
                                Read stdin as binary [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With