Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to 'checksum' an array of noisy floating point numbers?

What is a quick and easy way to 'checksum' an array of floating point numbers, while allowing for a specified small amount of inaccuracy?

e.g. I have two algorithms which should (in theory, with infinite precision) output the same array. But they work differently, and so floating point errors will accumulate differently, though the array lengths should be exactly the same. I'd like a quick and easy way to test if the arrays seem to be the same. I could of course compare the numbers pairwise, and report the maximum error; but one algorithm is in C++ and the other is in Mathematica and I don't want the bother of writing out the numbers to a file or pasting them from one system to another. That's why I want a simple checksum.

I could simply add up all the numbers in the array. If the array length is N, and I can tolerate an error of 0.0001 in each number, then I would check if abs(sum1-sum2)<0.0001*N. But this simplistic 'checksum' is not robust, e.g. to an error of +10 in one entry and -10 in another. (And anyway, probability theory says that the error probably grows like sqrt(N), not like N.) Of course, any checksum is a low-dimensional summary of a chunk of data so it will miss some errors, if not most... but simple checksums are nonetheless useful for finding non-malicious bug-type errors.

Or I could create a two-dimensional checksum, [sum(x[n]), sum(abs(x[n]))]. But is the best I can do, i.e. is there a different function I might use that would be "more orthogonal" to the sum(x[n])? And if I used some arbitrary functions, e.g. [sum(f1(x[n])), sum(f2(x[n]))], then how should my 'raw error tolerance' translate into 'checksum error tolerance'?

I'm programming in C++, but I'm happy to see answers in any language.

like image 633
DamonJW Avatar asked Mar 19 '12 01:03

DamonJW


4 Answers

i have a feeling that what you want may be possible via something like gray codes. if you could translate your values into gray codes and use some kind of checksum that was able to correct n bits you could detect whether or not the two arrays were the same except for n-1 bits of error, right? (each bit of error means a number is "off by one", where the mapping would be such that this was a variation in the least significant digit).

but the exact details are beyond me - particularly for floating point values.

i don't know if it helps, but what gray codes solve is the problem of pathological rounding. rounding sounds like it will solve the problem - a naive solution might round and then checksum. but simple rounding always has pathological cases - for example, if we use floor, then 0.9999999 and 1 are distinct. a gray code approach seems to address that, since neighbouring values are always single bit away, so a bit-based checksum will accurately reflect "distance".

[update:] more exactly, what you want is a checksum that gives an estimate of the hamming distance between your gray-encoded sequences (and the gray encoded part is easy if you just care about 0.0001 since you can multiple everything by 10000 and use integers).

and it seems like such checksums do exist: Any error-correcting code can be used for error detection. A code with minimum Hamming distance, d, can detect up to d − 1 errors in a code word. Using minimum-distance-based error-correcting codes for error detection can be suitable if a strict limit on the minimum number of errors to be detected is desired.

so, just in case it's not clear:

  • multiple by minimum error to get integers
  • convert to gray code equivalent
  • use an error detecting code with a minimum hamming distance larger than the error you can tolerate.

but i am still not sure that's right. you still get the pathological rounding in the conversion from float to integer. so it seems like you need a minimum hamming distance that is 1 + len(data) (worst case, with a rounding error on each value). is that feasible? probably not for large arrays.

maybe ask again with better tags/description now that a general direction is possible? or just add tags now? we need someone who does this for a living. [i added a couple of tags]

like image 155
andrew cooke Avatar answered Nov 01 '22 02:11

andrew cooke


Try this:

#include <complex>
#include <cmath>
#include <iostream>

// PARAMETERS
const size_t no_freqs = 3;
const double freqs[no_freqs] = {0.05, 0.16, 0.39}; // (for example)

int main() {
    std::complex<double> spectral_amplitude[no_freqs];
    for (size_t i = 0; i < no_freqs; ++i) spectral_amplitude[i] = 0.0;
    size_t n_data = 0;
    {
        std::complex<double> datum;
        while (std::cin >> datum) {
            for (size_t i = 0; i < no_freqs; ++i) {
                spectral_amplitude[i] += datum * std::exp(
                    std::complex<double>(0.0, 1.0) * freqs[i] * double(n_data)
                );
            }
            ++n_data;
        }
    }
    std::cout << "Fuzzy checksum:\n";
    for (size_t i = 0; i < no_freqs; ++i) {
        std::cout << real(spectral_amplitude[i]) << "\n";
        std::cout << imag(spectral_amplitude[i]) << "\n";
    }
    std::cout << "\n";
    return 0;
}

It returns just a few, arbitrary points of a Fourier transform of the entire data set. These make a fuzzy checksum, so to speak.

like image 40
thb Avatar answered Nov 01 '22 01:11

thb


I've spent a while looking for a deterministic answer, and been unable to find one. If there is a good answer, it's likely to require heavy-duty mathematical skills (functional analysis).

I'm pretty sure there is no solution based on "discretize in some cunning way, then apply a discrete checksum", e.g. "discretize into strings of 0/1/?, where ? means wildcard". Any discretization will have the property that two floating-point numbers very close to each other can end up with different discrete codes, and then the discrete checksum won't tell us what we want to know.

However, a very simple randomized scheme should work fine. Generate a pseudorandom string S from the alphabet {+1,-1}, and compute csx=sum(X_i*S_i) and csy=sum(Y_i*S_i), where X and Y are my original arrays of floating point numbers. If we model the errors as independent Normal random variables with mean 0, then it's easy to compute the distribution of csx-csy. We could do this for several strings S, and then do a hypothesis test that the mean error is 0. The number of strings S needed for the test is fixed, it doesn't grow linearly in the size of the arrays, so it satisfies my need for a "low-dimensional summary". This method also gives an estimate of the standard deviation of the error, which may be handy.

like image 3
DamonJW Avatar answered Nov 01 '22 00:11

DamonJW


How about computing a standard integer checksum on the data obtained by zeroing the least significant digits of the data, the ones that you don't care about?

like image 1
Ian Hinder Avatar answered Nov 01 '22 02:11

Ian Hinder