Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Plotting Histogram for all columns in a Data Frame

I am trying to draw histograms for all of the columns in my data frame. I imported pyspark and matplotlib. df is my data frame variable. plt is matplotlib.pyplot variable

I was able to draw/plot histogram for individual column, like this:

bins, counts = df.select('ColumnName').rdd.flatMap(lambda x: x).histogram(20)
plt.hist(bins[:-1], bins=bins, weights=counts)

But when I try to plot it for all variables I am having issues. Here is the for loop I have so far:

for x in range(0, len(df.columns)):
    bins, counts = df.select(x).rdd.flatMap(lambda x: x).histogram(20)
    plt.hist(bins[:-1], bins=bins, weights=counts)

How would I do it? Thanks in advance.

like image 218
Ram Avatar asked Apr 11 '18 16:04

Ram


People also ask

How do I make a histogram for all columns in R?

To create histogram of all columns in an R data frame, we can use hist. data. frame function of Hmisc package. For example, if we have a data frame df that contains five columns then the histogram for all the columns can be created by using a single line code as hist.

How do I plot a distribution of all columns in pandas?

Pandas has a tight integration with Matplotlib. You can plot data directly from your DataFrame using the plot() method. To plot multiple data columns in single frame we simply have to pass the list of columns to the y argument of the plot function.

How do you plot a histogram in a data frame?

In order to plot a histogram using pandas, chain the . hist() function to the dataframe. This will return the histogram for each numeric column in the pandas dataframe.


1 Answers

As an alternative to the for loop approach, I think you can try this:

df.hist(bins=30, figsize=(15, 10))

This will plot a histogram for each numerical attribute in the df DataFrame. Here, the bins and figsize arguments are just for customizing the output.

like image 82
Farid Avatar answered Oct 15 '22 23:10

Farid