Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Making histogram with Spark DataFrame column

I am trying to make a histogram with a column from a dataframe which looks like

DataFrame[C0: int, C1: int, ...]

If I were to make a histogram with the column C1, what should I do?

Some things I have tried are

df.groupBy("C1").count().histogram()
df.C1.countByValue()

Which do not work because of mismatch in data types.

like image 846
user2857014 Avatar asked Mar 16 '16 17:03

user2857014


5 Answers

The pyspark_dist_explore package that @Chris van den Berg mentioned is quite nice. If you prefer not to add an additional dependency you can use this bit of code to plot a simple histogram.

import matplotlib.pyplot as plt
# Show histogram of the 'C1' column
bins, counts = df.select('C1').rdd.flatMap(lambda x: x).histogram(20)

# This is a bit awkward but I believe this is the correct way to do it 
plt.hist(bins[:-1], bins=bins, weights=counts)
like image 84
Brian Wylie Avatar answered Oct 05 '22 21:10

Brian Wylie


What worked for me is

df.groupBy("C1").count().rdd.values().histogram()

I have to convert to RDD because I found histogram method in pyspark.RDD class, but not in spark.SQL module

like image 34
lanenok Avatar answered Oct 05 '22 21:10

lanenok


You can use histogram_numeric Hive UDAF:

import random

random.seed(323)

sqlContext = HiveContext(sc)
n = 3  # Number of buckets
df = sqlContext.createDataFrame(
    sc.parallelize(enumerate(random.random() for _ in range(1000))),
   ["id", "v"]
)

hists = df.selectExpr("histogram_numeric({0}, {1})".format("v", n))

hists.show(1, False)
## +------------------------------------------------------------------------------------+
## |histogram_numeric(v,3)                                                              |
## +------------------------------------------------------------------------------------+
## |[[0.2124888140177466,415.0], [0.5918851340384337,330.0], [0.8890271451209697,255.0]]|
## +------------------------------------------------------------------------------------+

You can also extract the column of interest and use histogram method on RDD:

df.select("v").rdd.flatMap(lambda x: x).histogram(n)
## ([0.002028109534323752,
##  0.33410233677189705,
##  0.6661765640094703,
##  0.9982507912470436],
## [327, 326, 347])
like image 21
zero323 Avatar answered Oct 05 '22 21:10

zero323


Let's say your values in C1 are between 1-1000 and you want to get a histogram of 10 bins. You can do something like: df.withColumn("bins", df.C1/100).groupBy("bins").count() If your binning is more complex you can make a UDF for it (and at worse, you might need to analyze the column first, e.g. by using describe or through some other method).

like image 44
Assaf Mendelson Avatar answered Oct 05 '22 23:10

Assaf Mendelson


If you want a to plot the Histogram, you could use the pyspark_dist_explore package:

fig, ax = plt.subplots()
hist(ax, df.groupBy("C1").count().select("count"))

If you would like the data in a pandas DataFrame you could use:

pandas_df = pandas_histogram(df.groupBy("C1").count().select("count"))
like image 34
Chris van den Berg Avatar answered Oct 05 '22 23:10

Chris van den Berg