I have a quick question. I am trying to make a fraud detection app in java, the app will be primarily based on Benford's law. Benford's law is super cool, it basically can be interpreted to say that in a real financial transaction the first digit is commonly a 1, 2, or 3 and very rarely an 8, 9. I haven't been able to get the Benford formula translated into code that can be run in Java.
http://www.mathpages.com/home/kmath302/kmath302.htm This link has more information about what the Benford law is and how it can be used.
I know that I will have to use the java math class to be able to use a natural log function, but I am not sure how to do that. Any help would be greatly appreciated.
Thanks so much!!
@Rui has mentioned how to compute the probability distribution function, but that's not going to help you much here.
What you want to use is either the Kolmogorov-Smirnov test or the Chi-squared test. Both are for used for comparing data to a known probability distribution to determine whether the dataset is likely/unlikely to have that probability distribution.
Chi-squared is for discrete distributions, and K-S is for continuous.
For using chi-squared with Benford's law, you would just create a histogram H[N], e.g. with 9 bins N=1,2,... 9, iterate over the dataset to check the first digit to count # of samples for each of the 9 non-zero digits (or first two digits with 90 bins). Then run the chi-squared test to compare the histogram with the expected count E[N].
For example, let's say you have 100 pieces of data. E[N] can be computed from Benford's Law:
E[1] = 30.1030 (=100*log(1+1))
E[2] = 17.6091 (=100*log(1+1/2))
E[3] = 12.4939 (=100*log(1+1/3))
E[4] = 9.6910
E[5] = 7.9181
E[6] = 6.6946
E[7] = 5.7992
E[8] = 5.1152
E[9] = 4.5757
Then compute Χ2 = sum((H[k]-E[k])^2/E[k]), and compare to a threshold as specified in the test. (Here we have a fixed distribution with no parameters, so the number of parameters s=0 and p = s+1 = 1, and the # of bins n is 9, so the # of degrees of freedom = n-p = 8*. Then you go to your handy-dandy chi-squared table and see if the numbers look ok. For 8 degrees of freedom the confidence levels look like this:
Χ2 > 13.362: 10% chance the dataset still matches Benford's Law
Χ2 > 15.507: 5% chance the dataset still matches Benford's Law
Χ2 > 17.535: 2.5% chance the dataset still matches Benford's Law
Χ2 > 20.090: 1% chance the dataset still matches Benford's Law
Χ2 > 26.125: 0.1% chance the dataset still matches Benford's Law
Suppose your histogram yielded H = [29,17,12,10,8,7,6,5,6], for a Χ2 = 0.5585. That's very close to the expected distribution. (maybe even too close!)
Now suppose your histogram yielded H = [27,16,10,9,5,11,6,5,11], for a Χ2 = 13.89. There is less than a 10% chance that this histogram is from a distribution that matches Benford's Law. So I'd call the dataset questionable but not overly so.
Note that you have to pick the significance level (e.g. 10%/5%/etc.). If you use 10%, expect roughly 1 out of every 10 datasets that are really from Benford's distribution to fail, even though they're OK. It's a judgement call.
Looks like Apache Commons Math has a Java implementation of a chi-squared test:
ChiSquareTestImpl.chiSquare(double[] expected, long[] observed)
*note on degrees of freedom = 8: this makes sense; you have 9 numbers but they have 1 constraint, namely they all have to add up to the size of the dataset, so once you know the first 8 numbers of the histogram, you can figure out the ninth.
Kolmogorov-Smirnov is actually simpler (something I hadn't realized until I found a simple enough statement of how it works) but works for continuous distributions. The method works like this:
Let's handle these more in depth for Benford's Law.
CDF for Benford's Law: this is just C = log10 x, where x is in the interval [1,10), i.e. including 1 but excluding 10. This can be easily seen if you look at the generalized form of Benford's Law, and instead of writing it log(1+1/n), writing it as log(n+1)-log(n) -- in other words, to get the probability of each bin, they're subtracting successive differences of log(n), so log(n) must be the CDF
ECDF: Take your dataset, and for each number, make the sign positive, write it in scientific notation, and set the exponent to 0. (Not sure what to do if you have a number that is 0; that seems to not lend itself to Benford's Law analysis.) Then sort the numbers in ascending order. The ECDF is the number of datapoints <= x for any valid x.
Calculate maximum difference D = max(d[k]) for each d[k] = max(CDF(y[k]) - (k-1)/N, k/N - CDF(y[k]).
Here's an example: suppose our dataset = [3.02, 1.99, 28.3, 47, 0.61]. Then ECDF is represented by the sorted array [1.99, 2.83, 3.02, 4.7, 6.1], and you calculate D as follows:
D = max(
log10(1.99) - 0/5, 1/5 - log10(1.99),
log10(2.83) - 1/5, 2/5 - log10(2.83),
log10(3.02) - 2/5, 3/5 - log10(3.02),
log10(4.70) - 3/5, 4/5 - log10(4.70),
log10(6.10) - 4/5, 5/5 - log10(6.10)
)
which = 0.2988 (=log10(1.99) - 0).
Finally you have to use the D statistic -- I can't seem to find any reputable tables online, but Apache Commons Math has a KolmogorovSmirnovDistributionImpl.cdf() function that takes a calculated D value as input and tells you the probability that D would be less than this. It's probably easier to take 1-cdf(D) which tells you the probability that D would be greater than or equal to the value you calculate: if this is 1% or 0.1% it probably means that the data doesn't fit Benford's Law, but if it's 25% or 50% it's probably a good match.
If I understand correctly, you want the Benford formula in Java syntax?
public static double probability(int i) {
return Math.log(1+(1/(double) i))/Math.log(10);
}
Remember to insert a
import java.lang.Math;
after your package declaration.
I find it suspicious no one answered this yet.... >_>
I think what you are looking for is something like this:
for(int i = (int)Math.pow(10, position-1); i <= (Math.pow(10, position)-1); i++)
{
answer += Math.log(1+(1/(i*10+(double) digit)));
}
answer *= (1/Math.log(10)));
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With