Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does matplotlib's histogramdd work?

I find the output of histogramdd confusing. For example:

h, edges = histogramdd([[1,2,1],[4,2,1]],bins=2)

h -> [[ 1.  1.]
     [ 1.  0.]]
edges -> [array([ 1. ,  1.5,  2. ]), array([ 1. ,  2.5,  4. ])]

Maybe I don't understand the documentation, but it seems to suggest that the input should be an array with N rows representing data points and D columns representing dimensions (so in this case, we are dealing with two data points in three dimensions) and I guess that each array in edges represents a different dimension but that doesn't seem to make sense based on the output h.

How is this supposed to be interpreted?

Thanks

like image 881
Robert Smith Avatar asked Nov 03 '12 01:11

Robert Smith


1 Answers

UPDATE

I was wrong the last time. Now this is the correct interpretation of histogramdd. First of all, it's very important to use an array in histogramdd, otherwise it will output spurious results:

Compare this:

In [59]: h, edges = histogramdd([[1,2,4],[4,2,8],[3,2,1],[2,1,2],[2,1,3],[2,1,1],[2,1,4]],bins=3)
h.shape
Out[59]: (3, 3, 3, 3, 3, 3, 3)

to this:

In [60]: h, edges = histogramdd(array([[1,2,4],[4,2,8],[3,2,1],[2,1,2],[2,1,3],[2,1,1],[2,1,4]]),bins=3)
h.shape
Out[60]: (3, 3, 3)

Using the second approach, we obtain sensible results:

In [61]: h, edges = histogramdd(array([[1,2,4],[4,2,8],[3,2,1],[2,1,2],[2,1,3],[2,1,1],[2,1,4]]),bins=3)
In [64]: h
Out[64]:
array([[[ 0.,  0.,  0.],
        [ 0.,  0.,  0.],
        [ 0.,  1.,  0.]],

       [[ 3.,  1.,  0.],
        [ 0.,  0.,  0.],
        [ 0.,  0.,  0.]],

       [[ 0.,  0.,  0.],
        [ 0.,  0.,  0.],
        [ 1.,  0.,  1.]]])
In [65]: edges
Out[65]:
[array([ 1.,  2.,  3.,  4.]),
 array([ 1.        ,  1.33333333,  1.66666667,  2.        ]),
 array([ 1.        ,  3.33333333,  5.66666667,  8.        ])]

Our input is [1,2,4], [4,2,8], etc. edges represents the bins for each dimension. In this example, [1,2,4] is counted as follows: 1 belongs to the first bin of array([1.,2.,3.,4.]) because it's between 1 and 2, 2 belongs to the third bin of array([ 1. , 1.33333333, 1.66666667, 2. ]) because it's between 1.6666667 and 2 and 4 belongs to the second bin of array([ 1. , 3.33333333, 5.66666667, 8. ]) because it's between 3.33333333 and 5.66666667. So we have the first bin, third bin and second bin for the coordinates of the point [1,2,4]. This means that we are counting that element in the first array, third row, second column:

[[ 0.,  0.,  0.],
[ 0.,  0.,  0.],
[ 0.,  1*.,  0.]] 

I added a * to let you identify it more easily. The second coordinate [4,2,8] is located in the third bin, third bin and third bin for x, y, z respectively (third array, third row, third column):

[[ 0.,  0.,  0.],
[ 0.,  0.,  0.],
[ 1.,  0.,  1.*]]])

As a final example, the third coordinate [3,2,1] is located in the third bin, third bin and first bin for x, y, z respectively (third array, third row, first column):

[[ 0.,  0.,  0.],
 [ 0.,  0.,  0.],
 [ 1.*,  0.,  1.]]
like image 117
Robert Smith Avatar answered Oct 05 '22 23:10

Robert Smith