I am trying to draw histograms for all of the columns in my data frame.
I imported pyspark
and matplotlib
.
df is my data frame variable.
plt is matplotlib.pyplot
variable
I was able to draw/plot histogram for individual column, like this:
bins, counts = df.select('ColumnName').rdd.flatMap(lambda x: x).histogram(20)
plt.hist(bins[:-1], bins=bins, weights=counts)
But when I try to plot it for all variables I am having issues. Here is the for loop I have so far:
for x in range(0, len(df.columns)):
bins, counts = df.select(x).rdd.flatMap(lambda x: x).histogram(20)
plt.hist(bins[:-1], bins=bins, weights=counts)
How would I do it? Thanks in advance.
To create histogram of all columns in an R data frame, we can use hist. data. frame function of Hmisc package. For example, if we have a data frame df that contains five columns then the histogram for all the columns can be created by using a single line code as hist.
Pandas has a tight integration with Matplotlib. You can plot data directly from your DataFrame using the plot() method. To plot multiple data columns in single frame we simply have to pass the list of columns to the y argument of the plot function.
In order to plot a histogram using pandas, chain the . hist() function to the dataframe. This will return the histogram for each numeric column in the pandas dataframe.
As an alternative to the for
loop approach, I think you can try this:
df.hist(bins=30, figsize=(15, 10))
This will plot a histogram for each numerical attribute in the df
DataFrame. Here, the bins
and figsize
arguments are just for customizing the output.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With