I have a 2D histogram h1 with var1 on the x axis and var2 on the y axis, which I've plotted from a dataframe. I have normalised it as I want in c++ but now need to do the same in python and am struggling with how to get and set bin content.
The idea is to remove the effect of having more events in one part of the distribution than in another and only leave the correlation between var1 and var2.
Working Code in c++:
double norm = h1->GetEntries()/h1->GetNbinsX();
int nbins = h1->GetNbinsX();
for(int i = 1; i< nbins+1; i++)
{
double nevents = 0.;
for(int iy = 1; iy< h1->GetNbinsY()+1; iy++)
{
float bincont = h1->GetBinContent(i,iy);
nevents+=bincont;
}
for(int iy = 1; iy< h1->GetNbinsY()+1; iy++)
{
float bincont = h1->GetBinContent(i,iy);
float fact = norm/nevents;
float value = bincont*fact;
h1->SetBinContent(i,iy,value);
}
}
Attempt for code in python:
plt.hist2d(var1, var2, bins=(11100, 1030), cmap=plt.cm.BuPu)
norm = 10
for i in var1:
nevents = 0.
for j in var2:
plt.GetBinContent(i,j)
nevents+=bincont
for j in var2:
plt.GetBinContent(i,j)
fact = norm/nevents
value = bincont*fact
plt.SetBinContent(i, j, value)
Edit after help from @JohanC:
Problem has been resolved. Make sure you don't have nan-s when normalising, because dealing with them is always a pain.
To manipulate the contents of the bins, you could first calculate them, change them and only then draw the plot.
plt.hist2d() returns the bin contents (a 2D matrix) together with the bin edges in both directions. To get the same information without plotting, np.histogram2d() returns exactly the same values. Afterwards, the result can be plotted via plt.pcolormesh().
For some reason, the returned matrix is transposed. So, the first step is to transpose it again.
To calculate sums and do multiplications and divisions on 2D arrays, numpy has some powerful array and broadcasting operations. The double loops in C++ are just one operation in numpy: hist *= norm / hist.sum(axis=0, keepdims=True). As the denominator can be zero, the warning can be suppressed (the result will be NaNs and Infs that are ignored for plotting).
Here is some demo code. Note that using bins=(11100, 1030) is extremely large. The code below uses much smaller values.
from matplotlib import pyplot as plt
import numpy as np
N = 1000000
var1 = np.concatenate([np.random.uniform(0, 20, size=9 * N // 10), np.random.normal(10, 1, size=N // 10)])
var2 = var1 * 0.1 + np.random.normal(size=N)
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12, 4))
norm = 10
binsX = 200
binsY = 100
ax1.hist2d(var1, var2, bins=(binsX, binsY), cmap='BuPu')
ax1.set_title('regular 2d histogram')
hist, xedges, yedges = np.histogram2d(var1, var2, bins=(binsX, binsY))
hist = hist.T
with np.errstate(divide='ignore', invalid='ignore'): # suppress division by zero warnings
hist *= norm / hist.sum(axis=0, keepdims=True)
ax2.pcolormesh(xedges, yedges, hist, cmap='BuPu')
ax2.set_title('normalized columns')
plt.show()

PS: About hist *= norm / hist.sum(axis=0, keepdims=True):
hist.sum(axis=0, keepdims=True) creates a new matrix (name it s) where for each h[i, j] the elements are replaced by the sum over all i, so s[i, j] = sum([h[k,j] for k in range(0, N)]). Without keepdims=True, a 1D array would be created with only the sums.hist *= norm / s creates a loop over all i,j as in h[i,j]=h[i,j]*norm/s[i,j]. Division by zero creates NaN when dividing zero by zero, and inf when dividing another number by zero. These values are ignored by pcolormesh.Optionally you could execute nan_to_num():
hist = np.nan_to_num(hist, nan=0, posinf=0, neginf=0)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With