Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Benford's Law in Java - how to make a math function into Java

I have a quick question. I am trying to make a fraud detection app in java, the app will be primarily based on Benford's law. Benford's law is super cool, it basically can be interpreted to say that in a real financial transaction the first digit is commonly a 1, 2, or 3 and very rarely an 8, 9. I haven't been able to get the Benford formula translated into code that can be run in Java.

http://www.mathpages.com/home/kmath302/kmath302.htm This link has more information about what the Benford law is and how it can be used.

I know that I will have to use the java math class to be able to use a natural log function, but I am not sure how to do that. Any help would be greatly appreciated.

Thanks so much!!

like image 251
ohGosh Avatar asked Oct 18 '11 23:10

ohGosh


3 Answers

@Rui has mentioned how to compute the probability distribution function, but that's not going to help you much here.

What you want to use is either the Kolmogorov-Smirnov test or the Chi-squared test. Both are for used for comparing data to a known probability distribution to determine whether the dataset is likely/unlikely to have that probability distribution.

Chi-squared is for discrete distributions, and K-S is for continuous.


For using chi-squared with Benford's law, you would just create a histogram H[N], e.g. with 9 bins N=1,2,... 9, iterate over the dataset to check the first digit to count # of samples for each of the 9 non-zero digits (or first two digits with 90 bins). Then run the chi-squared test to compare the histogram with the expected count E[N].

For example, let's say you have 100 pieces of data. E[N] can be computed from Benford's Law:

E[1] = 30.1030 (=100*log(1+1))
E[2] = 17.6091 (=100*log(1+1/2))
E[3] = 12.4939 (=100*log(1+1/3))
E[4] =  9.6910
E[5] =  7.9181
E[6] =  6.6946
E[7] =  5.7992
E[8] =  5.1152
E[9] =  4.5757

Then compute Χ2 = sum((H[k]-E[k])^2/E[k]), and compare to a threshold as specified in the test. (Here we have a fixed distribution with no parameters, so the number of parameters s=0 and p = s+1 = 1, and the # of bins n is 9, so the # of degrees of freedom = n-p = 8*. Then you go to your handy-dandy chi-squared table and see if the numbers look ok. For 8 degrees of freedom the confidence levels look like this:

Χ2 > 13.362: 10% chance the dataset still matches Benford's Law

Χ2 > 15.507: 5% chance the dataset still matches Benford's Law

Χ2 > 17.535: 2.5% chance the dataset still matches Benford's Law

Χ2 > 20.090: 1% chance the dataset still matches Benford's Law

Χ2 > 26.125: 0.1% chance the dataset still matches Benford's Law

Suppose your histogram yielded H = [29,17,12,10,8,7,6,5,6], for a Χ2 = 0.5585. That's very close to the expected distribution. (maybe even too close!)

Now suppose your histogram yielded H = [27,16,10,9,5,11,6,5,11], for a Χ2 = 13.89. There is less than a 10% chance that this histogram is from a distribution that matches Benford's Law. So I'd call the dataset questionable but not overly so.

Note that you have to pick the significance level (e.g. 10%/5%/etc.). If you use 10%, expect roughly 1 out of every 10 datasets that are really from Benford's distribution to fail, even though they're OK. It's a judgement call.

Looks like Apache Commons Math has a Java implementation of a chi-squared test:

ChiSquareTestImpl.chiSquare(double[] expected, long[] observed)


*note on degrees of freedom = 8: this makes sense; you have 9 numbers but they have 1 constraint, namely they all have to add up to the size of the dataset, so once you know the first 8 numbers of the histogram, you can figure out the ninth.


Kolmogorov-Smirnov is actually simpler (something I hadn't realized until I found a simple enough statement of how it works) but works for continuous distributions. The method works like this:

  • You compute the cumulative distribution function (CDF) for your probability distribution.
  • You compute an empirical cumulative distribution function (ECDF), which is easily obtained by putting your dataset in sorted order.
  • You find D = (approximately) the maximum vertical distance between the two curves.

enter image description here

Let's handle these more in depth for Benford's Law.

  1. CDF for Benford's Law: this is just C = log10 x, where x is in the interval [1,10), i.e. including 1 but excluding 10. This can be easily seen if you look at the generalized form of Benford's Law, and instead of writing it log(1+1/n), writing it as log(n+1)-log(n) -- in other words, to get the probability of each bin, they're subtracting successive differences of log(n), so log(n) must be the CDF

  2. ECDF: Take your dataset, and for each number, make the sign positive, write it in scientific notation, and set the exponent to 0. (Not sure what to do if you have a number that is 0; that seems to not lend itself to Benford's Law analysis.) Then sort the numbers in ascending order. The ECDF is the number of datapoints <= x for any valid x.

  3. Calculate maximum difference D = max(d[k]) for each d[k] = max(CDF(y[k]) - (k-1)/N, k/N - CDF(y[k]).

Here's an example: suppose our dataset = [3.02, 1.99, 28.3, 47, 0.61]. Then ECDF is represented by the sorted array [1.99, 2.83, 3.02, 4.7, 6.1], and you calculate D as follows:

D = max(
  log10(1.99) - 0/5, 1/5 - log10(1.99),
  log10(2.83) - 1/5, 2/5 - log10(2.83),
  log10(3.02) - 2/5, 3/5 - log10(3.02),
  log10(4.70) - 3/5, 4/5 - log10(4.70),
  log10(6.10) - 4/5, 5/5 - log10(6.10)
)

which = 0.2988 (=log10(1.99) - 0).

Finally you have to use the D statistic -- I can't seem to find any reputable tables online, but Apache Commons Math has a KolmogorovSmirnovDistributionImpl.cdf() function that takes a calculated D value as input and tells you the probability that D would be less than this. It's probably easier to take 1-cdf(D) which tells you the probability that D would be greater than or equal to the value you calculate: if this is 1% or 0.1% it probably means that the data doesn't fit Benford's Law, but if it's 25% or 50% it's probably a good match.

like image 123
Jason S Avatar answered Sep 20 '22 04:09

Jason S


If I understand correctly, you want the Benford formula in Java syntax?

public static double probability(int i) {   
    return Math.log(1+(1/(double) i))/Math.log(10);
}

Remember to insert a

import java.lang.Math;

after your package declaration.

I find it suspicious no one answered this yet.... >_>

like image 40
Rui Vieira Avatar answered Sep 21 '22 04:09

Rui Vieira


I think what you are looking for is something like this:

for(int i = (int)Math.pow(10, position-1); i <= (Math.pow(10, position)-1); i++)
        {
           answer +=  Math.log(1+(1/(i*10+(double) digit)));
        }

answer *= (1/Math.log(10)));
like image 24
Chester Moistmuffins Avatar answered Sep 21 '22 04:09

Chester Moistmuffins