Benford's Law in Java - how to make a math function into Java

Question

I have a quick question. I am trying to make a fraud detection app in java, the app will be primarily based on Benford's law. Benford's law is super cool, it basically can be interpreted to say that in a real financial transaction the first digit is commonly a 1, 2, or 3 and very rarely an 8, 9. I haven't been able to get the Benford formula translated into code that can be run in Java.

http://www.mathpages.com/home/kmath302/kmath302.htm This link has more information about what the Benford law is and how it can be used.

I know that I will have to use the java math class to be able to use a natural log function, but I am not sure how to do that. Any help would be greatly appreciated.

Thanks so much!!

Jason S · Accepted Answer

@Rui has mentioned how to compute the probability distribution function, but that's not going to help you much here.

What you want to use is either the Kolmogorov-Smirnov test or the Chi-squared test. Both are for used for comparing data to a known probability distribution to determine whether the dataset is likely/unlikely to have that probability distribution.

Chi-squared is for discrete distributions, and K-S is for continuous.

For using chi-squared with Benford's law, you would just create a histogram H[N], e.g. with 9 bins N=1,2,... 9, iterate over the dataset to check the first digit to count # of samples for each of the 9 non-zero digits (or first two digits with 90 bins). Then run the chi-squared test to compare the histogram with the expected count E[N].

For example, let's say you have 100 pieces of data. E[N] can be computed from Benford's Law:

E[1] = 30.1030 (=100*log(1+1))
E[2] = 17.6091 (=100*log(1+1/2))
E[3] = 12.4939 (=100*log(1+1/3))
E[4] =  9.6910
E[5] =  7.9181
E[6] =  6.6946
E[7] =  5.7992
E[8] =  5.1152
E[9] =  4.5757

Then compute Χ² = sum((H[k]-E[k])^2/E[k]), and compare to a threshold as specified in the test. (Here we have a fixed distribution with no parameters, so the number of parameters s=0 and p = s+1 = 1, and the # of bins n is 9, so the # of degrees of freedom = n-p = 8*. Then you go to your handy-dandy chi-squared table and see if the numbers look ok. For 8 degrees of freedom the confidence levels look like this:

Χ² > 13.362: 10% chance the dataset still matches Benford's Law

Χ² > 15.507: 5% chance the dataset still matches Benford's Law

Χ² > 17.535: 2.5% chance the dataset still matches Benford's Law

Χ² > 20.090: 1% chance the dataset still matches Benford's Law

Χ² > 26.125: 0.1% chance the dataset still matches Benford's Law

Suppose your histogram yielded H = [29,17,12,10,8,7,6,5,6], for a Χ² = 0.5585. That's very close to the expected distribution. (maybe even too close!)

Now suppose your histogram yielded H = [27,16,10,9,5,11,6,5,11], for a Χ² = 13.89. There is less than a 10% chance that this histogram is from a distribution that matches Benford's Law. So I'd call the dataset questionable but not overly so.

Note that you have to pick the significance level (e.g. 10%/5%/etc.). If you use 10%, expect roughly 1 out of every 10 datasets that are really from Benford's distribution to fail, even though they're OK. It's a judgement call.

Looks like Apache Commons Math has a Java implementation of a chi-squared test:

ChiSquareTestImpl.chiSquare(double[] expected, long[] observed)

*note on degrees of freedom = 8: this makes sense; you have 9 numbers but they have 1 constraint, namely they all have to add up to the size of the dataset, so once you know the first 8 numbers of the histogram, you can figure out the ninth.

Kolmogorov-Smirnov is actually simpler (something I hadn't realized until I found a simple enough statement of how it works) but works for continuous distributions. The method works like this:

You compute the cumulative distribution function (CDF) for your probability distribution.
You compute an empirical cumulative distribution function (ECDF), which is easily obtained by putting your dataset in sorted order.
You find D = (approximately) the maximum vertical distance between the two curves.

enter image description here

Let's handle these more in depth for Benford's Law.

CDF for Benford's Law: this is just C = log₁₀ x, where x is in the interval [1,10), i.e. including 1 but excluding 10. This can be easily seen if you look at the generalized form of Benford's Law, and instead of writing it log(1+1/n), writing it as log(n+1)-log(n) -- in other words, to get the probability of each bin, they're subtracting successive differences of log(n), so log(n) must be the CDF
ECDF: Take your dataset, and for each number, make the sign positive, write it in scientific notation, and set the exponent to 0. (Not sure what to do if you have a number that is 0; that seems to not lend itself to Benford's Law analysis.) Then sort the numbers in ascending order. The ECDF is the number of datapoints <= x for any valid x.
Calculate maximum difference D = max(d[k]) for each d[k] = max(CDF(y[k]) - (k-1)/N, k/N - CDF(y[k]).

Here's an example: suppose our dataset = [3.02, 1.99, 28.3, 47, 0.61]. Then ECDF is represented by the sorted array [1.99, 2.83, 3.02, 4.7, 6.1], and you calculate D as follows:

D = max(
  log10(1.99) - 0/5, 1/5 - log10(1.99),
  log10(2.83) - 1/5, 2/5 - log10(2.83),
  log10(3.02) - 2/5, 3/5 - log10(3.02),
  log10(4.70) - 3/5, 4/5 - log10(4.70),
  log10(6.10) - 4/5, 5/5 - log10(6.10)
)

which = 0.2988 (=log10(1.99) - 0).

Finally you have to use the D statistic -- I can't seem to find any reputable tables online, but Apache Commons Math has a KolmogorovSmirnovDistributionImpl.cdf() function that takes a calculated D value as input and tells you the probability that D would be less than this. It's probably easier to take 1-cdf(D) which tells you the probability that D would be greater than or equal to the value you calculate: if this is 1% or 0.1% it probably means that the data doesn't fit Benford's Law, but if it's 25% or 50% it's probably a good match.

Rui Vieira · Answer

If I understand correctly, you want the Benford formula in Java syntax?

public static double probability(int i) {   
    return Math.log(1+(1/(double) i))/Math.log(10);
}

Remember to insert a

import java.lang.Math;

after your package declaration.

I find it suspicious no one answered this yet.... >_>

Chester Moistmuffins · Answer

I think what you are looking for is something like this:

for(int i = (int)Math.pow(10, position-1); i <= (Math.pow(10, position)-1); i++)
        {
           answer +=  Math.log(1+(1/(i*10+(double) digit)));
        }

answer *= (1/Math.log(10)));

Benford's Law in Java - how to make a math function into Java

Tags:

java

syntax

function

math

benfords-law

ohGosh

3 Answers

Jason S

Rui Vieira

Chester Moistmuffins

Recent Activity

Donate For Us

Benford's Law in Java - how to make a math function into Java

Tags:

java

syntax

function

math

benfords-law

ohGosh

3 Answers

Jason S

Rui Vieira

Chester Moistmuffins

Related questions

Recent Activity

Donate For Us