Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python equivalent fo hive NumericHistogram

Tags:

python

pypy

hive

I have a python(to be more specific -- pypy compatible) code. I have to use histograms in it, which should be very similar if not exactly same as this java histogram implementation in Hive project. Numpy has a histogram implementation. Does anyone know if these two are equivalent in use or can be made by choosing appropriate parameter values? I can find this by reading the codes, though checking here if someone already knows this.

like image 486
PHcoDer Avatar asked Mar 07 '26 02:03

PHcoDer


1 Answers

Long answer, TLDR in bold

You can just compile and run both instead of trying to theory craft and guess at what the code does. When in doubt, just run it. With a little bit of work, you can get NumericHistogram.java that you linked to compile without maven and just using a CLI javac call (just delete the references to the hive packages and associated methods).

I simply tested both on the array [0,1,...,98,99].

Java version (java 10.0.2):

EDIT: Received feedback (by email) to include the Java code. Here it is (removed docstrings and some comments, and didn't include all the public methods):

/*                                                                                                                                                                                                          
 * Licensed to the Apache Software Foundation (ASF) under one                                                                                                                                               
 * or more contributor license agreements.  See the NOTICE file                                                                                                                                             
 * distributed with this work for additional information                                                                                                                                                    
 * regarding copyright ownership.  The ASF licenses this file                                                                                                                                               
 * to you under the Apache License, Version 2.0 (the                                                                                                                                                        
 * "License"); you may not use this file except in compliance                                                                                                                                               
 * with the License.  You may obtain a copy of the License at                                                                                                                                               
 *                                                                                                                                                                                                          
 *     http://www.apache.org/licenses/LICENSE-2.0                                                                                                                                                           
 *                                                                                                                                                                                                          
 * Unless required by applicable law or agreed to in writing, software                                                                                                                                      
 * distributed under the License is distributed on an "AS IS" BASIS,                                                                                                                                        
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.                                                                                                                                 
 * See the License for the specific language governing permissions and                                                                                                                                      
 * limitations under the License.                                                                                                                                                                           
 */

import java.util.ArrayList;
import java.util.List;
import java.util.Collections;
import java.util.Random;

public class NumericHistogram {
    static class Coord implements Comparable {
        double x;
        double y;

        public int compareTo(Object other) {
            return Double.compare(x, ((Coord) other).x);
        }
    };

    // Class variables                                                                                                                                                                                      
    private int nbins;
    private int nusedbins;
    private ArrayList<Coord> bins;
    private Random prng;

    public NumericHistogram() {
        nbins = 0;
        nusedbins = 0;
        bins = null;

        // init the RNG for breaking ties in histogram merging. A fixed seed is specified here                                                                                                              
        // to aid testing, but can be eliminated to use a time-based seed (which would                                                                                                                      
        // make the algorithm non-deterministic).                                                                                                                                                           
        prng = new Random(31183);
    }

    public void reset() {
        bins = null;
        nbins = nusedbins = 0;
    }

    public int getUsedBins() {
        return nusedbins;
    }

    public boolean isReady() {
        return nbins != 0;
    }

    public Coord getBin(int b) {
        return bins.get(b);
    }

    public void allocate(int num_bins) {
        nbins = num_bins;
        bins = new ArrayList<Coord>();
        nusedbins = 0;
    }

    public void add(double v) {
        int bin = 0;
        for(int l=0, r=nusedbins; l < r; ) {
            bin = (l+r)/2;
            if (bins.get(bin).x > v) {
                r = bin;
            } else {
                if (bins.get(bin).x < v) {
                    l = ++bin;
                } else {
                    break; // break loop on equal comparator                                                                                                                                                
                }
            }
        }
        if (bin < nusedbins && bins.get(bin).x == v) {
            bins.get(bin).y++;
        } else {
            Coord newBin = new Coord();
            newBin.x = v;
            newBin.y = 1;
            bins.add(bin, newBin);

            // Trim the bins down to the correct number of bins.                                                                                                                                            
            if (++nusedbins > nbins) {
                trim();
            }
        }

    }

    private void trim() {
        while(nusedbins > nbins) {
            // Find the closest pair of bins in terms of x coordinates. Break ties randomly.                                                                                                                
            double smallestdiff = bins.get(1).x - bins.get(0).x;
            int smallestdiffloc = 0, smallestdiffcount = 1;
            for(int i = 1; i < nusedbins-1; i++) {
                double diff = bins.get(i+1).x - bins.get(i).x;
                if(diff < smallestdiff)  {
                    smallestdiff = diff;
                    smallestdiffloc = i;
                    smallestdiffcount = 1;
                } else {
                    if(diff == smallestdiff && prng.nextDouble() <= (1.0/++smallestdiffcount) ) {
                        smallestdiffloc = i;
                    }
                }
            }

            double d = bins.get(smallestdiffloc).y + bins.get(smallestdiffloc+1).y;
            Coord smallestdiffbin = bins.get(smallestdiffloc);
            smallestdiffbin.x *= smallestdiffbin.y / d;
            smallestdiffbin.x += bins.get(smallestdiffloc+1).x / d * bins.get(smallestdiffloc+1).y;
            smallestdiffbin.y = d;
            // Shift the remaining bins left one position                                                                                                                                                   
            bins.remove(smallestdiffloc+1);
            nusedbins--;
        }
    }

    public int getNumBins() {
        return bins == null ? 0 : bins.size();
    }
}

I inserted this main into the NumericHistogram class (see above):

public static void main(String[] args) {
    NumericHistogram hist = new NumericHistogram();
    hist.allocate(10);
    for (int i = 0; i < 100; i++) {
        hist.add(i);
    }

    ArrayList<Coord> bins = new ArrayList<Coord>();
    for (int i = 0; i < 10; i++) {
        bins.add(hist.getBin(i));
    }

    int index = 0;
    for (Coord bin : bins) {
        System.out.println(index + "th bin x: " + bin.x);
        System.out.println(index + "th bin y: " + bin.y);
        index++;
    }
}

This output:

Matthews-MacBook-Pro:stackoverflow matt$ java NumericHistogram
0th bin x: 2.0
0th bin y: 5.0
1th bin x: 9.5
1th bin y: 10.0
2th bin x: 21.5
2th bin y: 14.0
3th bin x: 33.0
3th bin y: 9.0
4th bin x: 44.0
4th bin y: 13.0
5th bin x: 55.0
5th bin y: 9.0
6th bin x: 64.5
6th bin y: 10.0
7th bin x: 75.5
7th bin y: 12.0
8th bin x: 88.0
8th bin y: 13.0
9th bin x: 97.0
9th bin y: 5.0

Python version:

The numpy version was a bit different. Here's the script:

import numpy as np

hist = np.histogram(np.arange(100)) # 10 is the default for num bins
print(hist)

This output:

(array([10, 10, 10, 10, 10, 10, 10, 10, 10, 10]), array([ 0. ,  9.9, 19.8, 29.7, 39.6, 49.5, 59.4, 69.3, 79.2, 89.1, 99. ]))

To compare to the java version, we'd really have to zip them, which you can do like this:

y_values, x_values = hist
i = 0
for x_val, y_val in zip(x_values, y_values):
    print(str(i) + "th bin x: " + str(x_val))
    print(str(i) + "th bin y: " + str(y_val))
    i += 1

and it output:

0th bin x: 0.0
0th bin y: 10
1th bin x: 9.9
1th bin y: 10
2th bin x: 19.8
2th bin y: 10
3th bin x: 29.700000000000003
3th bin y: 10
4th bin x: 39.6
4th bin y: 10
5th bin x: 49.5
5th bin y: 10
6th bin x: 59.400000000000006
6th bin y: 10
7th bin x: 69.3
7th bin y: 10
8th bin x: 79.2
8th bin y: 10
9th bin x: 89.10000000000001
9th bin y: 10

Summary and making them match:

Are they equivalent? No, they are not equivalent. They're using different partitioning schemes to assign values to bins. Java version puts the item in the "closest" bin using rounding rules, whereas numpy is doing a range grouping (i.e. 0-9.9, 9.9-19.8, ..., 89.1-100). Furthermore, the default bins they generate are also not the same, which kind of makes sense since the schemes are so different.

Can I get them to match? Yes, but you'll have to figure out how to generate random numbers exactly like Java in the Python realm if you want a general implementation.. I tried copying the implementation, and I got it to work, but I'm unable to get the random numbers to be the same. Even if you're using numpy instead of copying the logic to Python, you'll still have to be able to generate Java's random numbers in order to get them to match in the general case. This project seemed relevant, but it's only Java 6 RNG, not later versions of Java: https://github.com/MostAwesomeDude/java-random. At risk of making this giant answer even longer, here's the Python code to copy the Java implementation:

""" Just a mindless copy for easy verification. Not good style or performant. """
import random
class Coord:
    def __init__(self, x, y):
        self.x = float(x)
        self.y = float(y)

    def __repr__(self): # debug
        return "Coord(" + str(self.x) + ", " + str(self.y) + ")"

    def __str__(self): # debug
        return "Coord(" + str(self.x) + ", " + str(self.y) + ")"

class Random:
    """ This class needs fixin. You'll have to do some work here to make it match your version of Java. """
    def __init__(self, seed):
        random.seed(seed)

    def nextDouble(self):
        return random.uniform(0, 1)

class NumericHistogram:
    def __init__(self):
        self.nbins = 0
        self.nusedbins = 0
        self.bins = None

        self.prng = Random(31183) # This should behave the same as Java's RNG for your Java version.                                                                                                        

    def allocate(self, num_bins):
        self.nbins = num_bins
        self.bins = []
        self.nusedbins = 0

    def add(self, v):
        bin = 0

        l = 0
        r = self.nusedbins
        while(l < r):
            bin = (l+r)//2
            if self.bins[bin].x > v:
                r = bin
            else:
                if self.bins[bin].x < v:
                    l = bin + 1; bin += 1
                else:
                    break

        if bin < self.nusedbins and self.bins[bin].x == v:
            self.bins[bin].y += 1
        else:
            newBin = Coord(x=v, y=1)
            if bin == len(self.bins):
                self.bins.append(newBin)
            else:
                self.bins[bin] == newBin

            self.nusedbins += 1
            if (self.nusedbins > self.nbins):
                self.trim()
    def trim(self):
        while self.nusedbins > self.nbins:
            smallestdiff = self.bins[1].x - self.bins[0].x
            smallestdiffloc = 0
            smallestdiffcount = 1
            for i in range(1, self.nusedbins-1):
                diff = self.bins[i+1].x - self.bins[i].x
                if diff < smallestdiff:
                    smallestdiff = diff
                    smallestdiffloc = i
                    smallestdiffcount = 1
                else:
                    smallestdiffcount += 1
                    if diff == smallestdiff and self.prng.nextDouble() <= (1.0/smallestdiffcount):
                        smallestdiffloc = i

            d = self.bins[smallestdiffloc].y + self.bins[smallestdiffloc+1].y
            smallestdiffbin = self.bins[smallestdiffloc]
            smallestdiffbin.x *= smallestdiffbin.y / d
            smallestdiffbin.x += self.bins[smallestdiffloc+1].x / d * self.bins[smallestdiffloc+1].y
            smallestdiffbin.y = d
            self.bins.pop(smallestdiffloc+1)
            self.nusedbins -= 1

And the "main":

hist = NumericHistogram()
hist.allocate(10)
for i in range(100):
    hist.add(i)

for ind, bin in enumerate(hist.bins):
    print(str(ind) + "th bin x: " + str(bin.x))
    print(str(ind) + "th bin y: " + str(bin.y))

this outputs:

0th bin x: 4.0
0th bin y: 9.0
1th bin x: 13.0
1th bin y: 9.0
2th bin x: 24.0
2th bin y: 13.0
3th bin x: 35.0
3th bin y: 9.0
4th bin x: 46.5
4th bin y: 14.0
5th bin x: 57.5
5th bin y: 8.0
6th bin x: 66.0
6th bin y: 9.0
7th bin x: 76.49999999999999
7th bin y: 12.0
8th bin x: 88.99999999999999
8th bin y: 13.0
9th bin x: 97.5
9th bin y: 4.0

So it's kind of close, but no bananas. The difference is down to the RNG (as far as I can tell). I don't know a lot about Java Random or RNGs in general: so it might be best to just post another question here on SO about how to generate random numbers exactly like Java [insert-your-java-version-here].

HTH and can get you started in the right direction.

like image 74
Matt Messersmith Avatar answered Mar 08 '26 16:03

Matt Messersmith



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!