I am trying to make a histogram with a column from a dataframe which looks like
DataFrame[C0: int, C1: int, ...]
If I were to make a histogram with the column C1, what should I do?
Some things I have tried are
df.groupBy("C1").count().histogram()
df.C1.countByValue()
Which do not work because of mismatch in data types.
The pyspark_dist_explore package that @Chris van den Berg mentioned is quite nice. If you prefer not to add an additional dependency you can use this bit of code to plot a simple histogram.
import matplotlib.pyplot as plt
# Show histogram of the 'C1' column
bins, counts = df.select('C1').rdd.flatMap(lambda x: x).histogram(20)
# This is a bit awkward but I believe this is the correct way to do it
plt.hist(bins[:-1], bins=bins, weights=counts)
What worked for me is
df.groupBy("C1").count().rdd.values().histogram()
I have to convert to RDD because I found histogram
method in pyspark.RDD class, but not in spark.SQL module
You can use histogram_numeric
Hive UDAF:
import random
random.seed(323)
sqlContext = HiveContext(sc)
n = 3 # Number of buckets
df = sqlContext.createDataFrame(
sc.parallelize(enumerate(random.random() for _ in range(1000))),
["id", "v"]
)
hists = df.selectExpr("histogram_numeric({0}, {1})".format("v", n))
hists.show(1, False)
## +------------------------------------------------------------------------------------+
## |histogram_numeric(v,3) |
## +------------------------------------------------------------------------------------+
## |[[0.2124888140177466,415.0], [0.5918851340384337,330.0], [0.8890271451209697,255.0]]|
## +------------------------------------------------------------------------------------+
You can also extract the column of interest and use histogram
method on RDD
:
df.select("v").rdd.flatMap(lambda x: x).histogram(n)
## ([0.002028109534323752,
## 0.33410233677189705,
## 0.6661765640094703,
## 0.9982507912470436],
## [327, 326, 347])
Let's say your values in C1 are between 1-1000 and you want to get a histogram of 10 bins. You can do something like: df.withColumn("bins", df.C1/100).groupBy("bins").count() If your binning is more complex you can make a UDF for it (and at worse, you might need to analyze the column first, e.g. by using describe or through some other method).
If you want a to plot the Histogram, you could use the pyspark_dist_explore package:
fig, ax = plt.subplots()
hist(ax, df.groupBy("C1").count().select("count"))
If you would like the data in a pandas DataFrame you could use:
pandas_df = pandas_histogram(df.groupBy("C1").count().select("count"))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With