Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I deterministically sum a vector of arbitrarily arranged floating-point numbers?

Say I have a (possibly large) vector of floating-point numbers resulting from some black box process. Is it possible to calculate a bitwise reproducible summation of these numbers?

If the black boxprocess always produces the numbers in the same order, then a bitwise reproducible summation is easy: just sum them from left to right.

But if the numbers are generated in a random order, perhaps because they are returned and collected from asynchronous processes, then it is harder: I'd have to sort them numbers.

But what if there are even more numbers, perhaps distributed across different machines, so that it's undesirable to move them?

Is there still a way to deterministically sum them?

like image 603
Richard Avatar asked May 08 '21 05:05

Richard


People also ask

How do you sum a floating point number?

Add the fractional and integer part of the two numbers separately and forward the final carry part of fractional addition to integers part. Concatenate the digits stored for integer and fractional part with a decimal '. ' to get the required sum two large floating point numbers.

Is addition a floating point operation?

Arithmetic operations on floating point numbers consist of addition, subtraction, multiplication and division. The operations are done with algorithms similar to those used on sign magnitude integers (because of the similarity of representation) — example, only add numbers of the same sign.

Why floating point numbers Cannot be stored accurately?

Because often-times, they are approximating rationals that cannot be represented finitely in base 2 (the digits repeat), and in general they are approximating real (possibly irrational) numbers which may not be representable in finitely many digits in any base.

Why is floating point arithmetic not exact?

In contrast, given any fixed number of bits, most calculations with real numbers will produce quantities that cannot be exactly represented using that many bits. Therefore the result of a floating-point calculation must often be rounded in order to fit back into its finite representation.


Video Answer


2 Answers

Overview: A Multi-pass, Data-Mutating Method

(See here for a better answer.)

Yes, there is a way.

A method to do just this is described in the paper "Fast Reproducible Floating-Point Summation" by Demmel and Nguyen (2013). I've included both a serial and parallel implementation of it below.

Here we'll compare three algorithms:

  1. Conventional left-to-right summation: this is fast, inaccurate, and non-reproducible with respect to permutations in the input order.
  2. Kahan summation: this is slower, very accurate (essentially O(1) in the input size), and, while non-reproducible with respect to permutations in the input order, closer to reproducibility in a narrow sense.
  3. Deterministic summation: this is somewhat slower than Kahan summation, can be quite accurate, and is bitwise reproducible.

Conventional summation is inaccurate because as the sum grows the least-significant digits of the numbers we add to it get silently dropped. Kahan summation overcomes this by keeping a running "compensation" sum which holds onto the least-significant digits. The deterministic summation uses a similar strategy to maintain accuracy but, before performing the summation, "pre-rounds" the input numbers to a common base so that their sum can be computed accurately without any rounding error.

Tests

To study the algorithms, we'll run a test on 1 million numbers each drawn uniformly from the range [-1000, 1000]. To demonstrate both the range of answers of the algorithms as well as their determinism, the input will be shuffled and summed 100 times. The parallel algorithms are run using 12 threads.

The "correct" summation (as determined by using Kahan summation in long double mode and choosing the most frequently occuring value) is

310844.700699143717685046795

A result I'll assert, which the code below demonstrates rigorously, is that for each data type studied the deterministic algorithm yields the same result for each of the 100 data orderings and that these results are the same for both the serial and parallel variants of the algorithm.

The results of the various algorithms are as follows:

Serial Float
=========================================================
Simple summation accumulation type   = float
Average deterministic summation time = 0.00462s
Average simple summation time        = 0.00144s (3.20x faster than deterministic)
Average Kahan summation time         = 0.00290s (1.59x faster than deterministic)
Deterministic value                  = 256430592.000000000 (very different from actual)
Distinct Kahan values                = 3  (with single-precision accumulator)
Distinct Simple values               = 93 (with single-precision accumulator)
Distinct Simple values               = 1  (with double-precision accumulator)

Parallel Float
=========================================================
Simple summation accumulation type   = float
Average deterministic summation time = 0.00576s
Average simple summation time        = 0.00206s (2.79x faster than deterministic)
Average Kahan summation time         = 0.00599s (0.96x faster than deterministic)
Deterministic value                  = 256430592.000000000 (very different from actual)
Distinct Kahan values                = 3  (with single-precision accumulator)
Distinct Simple values               = 89 (with single-precision accumulator)
Distinct Simple values               = 1  (with double-precision accumulator)

Serial Double
=========================================================
Average deterministic summation time = 0.00600s
Average simple summation time        = 0.00171s (3.49x faster than deterministic)
Average Kahan summation time         = 0.00378s (1.58x faster than deterministic)
Deterministic value                  = 310844.70069914375199005 (epsilon difference from actual value)
Distinct Kahan values                = 4
Distinct Simple values               = 88

Parallel Double
=========================================================
Average deterministic summation time = 0.01074s
Average simple summation time        = 0.00289s (3.71x faster than deterministic)
Average Kahan summation time         = 0.00648s (1.65x faster than deterministic)
Deterministic value                  = 310844.70069914375199005 (epsilon difference from actual value)
Distinct Kahan values                = 2
Distinct Simple values               = 83

Serial Long Double
=========================================================
Average deterministic summation time = 0.01072s
Average simple summation time        = 0.00215s (4.96x faster than deterministic)
Average Kahan summation time         = 0.00448s (2.39x faster than deterministic)
Deterministic value                  = 310844.700699143717685046795 (no discernable difference from actual)
Distinct Kahan values                = 3
Distinct Simple values               = 94

Parallel Long Double
=========================================================
Average deterministic summation time = 0.01854s
Average simple summation time        = 0.00466s (3.97x faster than deterministic)
Average Kahan summation time         = 0.00635s (2.91x faster than deterministic)
Deterministic value                  = 310844.700699143717685046795 (no discernable difference from actual)
Distinct Kahan values                = 1
Distinct Simple values               = 82

Discussion

So, what have we learned? From the single-precision results we see that using a double-length accumulator makes the conventional summation algorithm reproducible for this data set, though that almost certainly wouldn't be the case for arbitrary datasets. Still, this can be a cheap way to get improved accuracy when it works.

When the accumulator used by conventional summation is of the same size as the numbers being added, it works very poorly: we ran 100 tests and got nearly as many distinct results from conventional summation.

Kahan summation on the other hand produces very few distinct results (thus it is "more" deterministic) and takes approximately twice as long as conventional summation.

The deterministic algorithm takes ~4x longer than simple summation and about 1.5-3x longer than Kahan summation, but gets approximately the same answers for both the double and long double types except with bitwise determinism (it always gets the same answer no matter how the input data is sorted).

However, the single-precision floating-point gets a very bad answer, unlike the single-precision conventional and Kahan summations, which turn out to be quite close to the actual answer. Why is this?

The paper we've been working from determines that if there are n numbers in the input and we use k folding rounds (the paper recommends 2, which is what we use here), then the deterministic and conventional sums will have similar error bounds provided

n^k * e^(k-1) << 1

where e is the floating-point epsilon of the data type. These epsilon values are:

Single-precision epsilon      = 0.000000119
Double-precision epsilon      = 0.00000000000000022
Long double-precision epsilon = 0.000000000000000000108

Plugging these in to the equation along with n=1M we get:

Single condition      = 119000
Double condition      = 0.00022
Long double condition = 0.000000108

So we see that it's expected that single-precision would perform poorly for an input of this size.

Another point to note is that although long double takes 16 bytes on my machine, this is only for alignment purposes: the true length that's used for compute is only 80-bits. Therefore, two long doubles will compare numerically equally but their unused bits are arbitrary. For true bitwise reproducibility the unused bits would need to be determined and set to a known value.

Code

//Compile with:
//g++ -g -O3 repro_vector.cpp -fopenmp

//NOTE: Always comile with `-g`. It doesn't slow down your code, but does make
//it debuggable and improves ease of profiling

#include <algorithm>
#include <bitset>               //Used for showing bitwise representations
#include <cfenv>                //Used for setting floating-point rounding modes
#include <chrono>               //Used for timing algorithms
#include <climits>              //Used for showing bitwise representations
#include <iostream>
#include <omp.h>                //OpenMP
#include <random>
#include <stdexcept>
#include <string>
#include <typeinfo>
#include <unordered_map>
#include <vector>

constexpr int ROUNDING_MODE = FE_UPWARD;
constexpr int N = 1'000'000;
constexpr int TESTS = 100;

// Simple timer class for tracking cumulative run time of the different
// algorithms
struct Timer {
  double total = 0;
  std::chrono::high_resolution_clock::time_point start_time;

  Timer() = default;

  void start() {
    start_time = std::chrono::high_resolution_clock::now();
  }

  void stop() {
    const auto now = std::chrono::high_resolution_clock::now();
    const auto time_span = std::chrono::duration_cast<std::chrono::duration<double>>(now - start_time);
    total += time_span.count();
  }
};

//Simple class to enable directed rounding in floating-point math and to reset
//the rounding mode afterwards, when it goes out of scope
struct SetRoundingMode {
  const int old_rounding_mode;

  SetRoundingMode(const int mode) : old_rounding_mode(fegetround()) {
    if(std::fesetround(mode)!=0){
      throw std::runtime_error("Failed to set directed rounding mode!");
    }
  }

  ~SetRoundingMode(){
    if(std::fesetround(old_rounding_mode)!=0){
      throw std::runtime_error("Failed to reset rounding mode to original value!");
    }
  }

  static std::string get_rounding_mode_string() {
    switch (fegetround()) {
      case FE_DOWNWARD:   return "downward";
      case FE_TONEAREST:  return "to-nearest";
      case FE_TOWARDZERO: return "toward-zero";
      case FE_UPWARD:     return "upward";
      default:            return "unknown";
    }
  }
};

// Used to make showing bitwise representations somewhat more intuitive
template<class T>
struct binrep {
  const T val;
  binrep(const T val0) : val(val0) {}
};

// Display the bitwise representation
template<class T>
std::ostream& operator<<(std::ostream& out, const binrep<T> a){
  const char* beg = reinterpret_cast<const char*>(&a.val);
  const char *const end = beg + sizeof(a.val);
  while(beg != end){
    out << std::bitset<CHAR_BIT>(*beg++);
    if(beg < end)
      out << ' ';
  }
  return out;
}



//Simple serial summation algorithm with an accumulation type we can specify
//to more fully explore its behaviour
template<class FloatType, class SimpleAccumType>
FloatType serial_simple_summation(const std::vector<FloatType> &vec){
  SimpleAccumType sum = 0;
  for(const auto &x: vec){
    sum += x;
  }
  return sum;
}

//Parallel variant of the simple summation algorithm above
template<class FloatType, class SimpleAccumType>
FloatType parallel_simple_summation(const std::vector<FloatType> &vec){
  SimpleAccumType sum = 0;
  #pragma omp parallel for default(none) reduction(+:sum) shared(vec)
  for(size_t i=0;i<vec.size();i++){
    sum += vec[i];
  }
  return sum;
}



//Kahan's compensated summation algorithm for accurately calculating sums of
//many numbers with O(1) error
template<class FloatType>
FloatType serial_kahan_summation(const std::vector<FloatType> &vec){
  FloatType sum = 0.0f;
  FloatType c = 0.0f;
  for (const auto &num: vec) {
    const auto y = num - c;
    const auto t = sum + y;
    c = (t - sum) - y;
    sum = t;
  }
  return sum;
}

//Parallel version of Kahan's compensated summation algorithm (could be improved
//by better accounting for the compsenation during the reduction phase)
template<class FloatType>
FloatType parallel_kahan_summation(const std::vector<FloatType> &vec){
  //Parallel phase
  std::vector<FloatType> sum(omp_get_max_threads(), 0);
  FloatType c = 0;
  #pragma omp parallel for default(none) firstprivate(c) shared(sum,vec)
  for (size_t i=0;i<vec.size();i++) {
    const auto tid = omp_get_thread_num();
    const auto y = vec[i] - c;
    const auto t = sum.at(tid) + y;
    c = (t - sum[tid]) - y;
    sum[tid] = t;
  }

  //Serial reduction phase

  //This could be more accurate if it took the remaining compensation values
  //from above into account
  FloatType total_sum = 0.0f;
  FloatType total_c = 0.0f;
  for(const auto &num: sum){
    const auto y = num - total_c;
    const auto t = total_sum + y;
    total_c = (t - total_sum) - y;
    total_sum = t;
  }

  return total_sum;
}



// Error-free vector transformation. Algorithm 4 from Demmel and Nguyen (2013)
template<class FloatType>
FloatType ExtractVectorNew2(
  const FloatType M,
  const typename std::vector<FloatType>::iterator &begin,
  const typename std::vector<FloatType>::iterator &end
){
  // Should use the directed rounding mode of the parent thread

  auto Mold = M;
  for(auto v=begin;v!=end;v++){
    auto Mnew = Mold + (*v);
    auto q = Mnew - Mold;
    (*v) -= q;
    Mold = Mnew;
  }

  //This is the exact sum of high order parts q_i
  //v is now the vector of low order parts r_i
  return Mold - M;
}

template<class FloatType>
FloatType mf_from_deltaf(const FloatType delta_f){
  const int power = std::ceil(std::log2(delta_f));
  return static_cast<FloatType>(3.0) * std::pow(2, power);
}

//Implements the error bound discussed near Equation 6 of
//Demmel and Nguyen (2013).
template<class FloatType>
bool is_error_bound_appropriate(const size_t N, const int k){
  const auto eps = std::numeric_limits<FloatType>::epsilon();
  const auto ratio = std::pow(N, k) * std::pow(eps, k-1);
  //If ratio << 1, then the conventional non-reproducible sum and the
  //deterministic sum will have error bounds of the same order. We arbitrarily
  //choose 1e-4 to represent this
  return ratio < 1e-3;
}

//Serial bitwise deterministic summation.
//Algorithm 8 from Demmel and Nguyen (2013).
template<class FloatType>
FloatType serial_bitwise_deterministic_summation(
  std::vector<FloatType> vec,  // Note that we're making a copy!
  const int k
){
  constexpr FloatType eps = std::numeric_limits<FloatType>::epsilon();
  const auto n = vec.size();
  const auto adr = SetRoundingMode(ROUNDING_MODE);

  if(n==0){
    return 0;
  }

  if(!is_error_bound_appropriate<FloatType>(vec.size(), k)){
    std::cout<<"WARNING! Error bounds of deterministic sum are large relative to conventional summation!"<<std::endl;
  }

  FloatType m = std::abs(vec.front());
  for(const auto &x: vec){
    m = std::max(m, std::abs(x));
  }

  FloatType delta_f = n * m / (1 - 4 * (n + 1) * eps);
  FloatType Mf = mf_from_deltaf(delta_f);

  std::vector<FloatType> Tf(k);
  for(int f=0;f<k-1;f++){
    Tf[f] = ExtractVectorNew2<FloatType>(Mf, vec.begin(), vec.end());
    delta_f = n * (4 * eps * Mf / 3) / (1 - 4 * (n + 1) * eps);
    Mf = mf_from_deltaf(delta_f);
  }

  FloatType M = Mf;
  for(const FloatType &v: vec){
    M += v;
  }
  Tf[k-1] = M - Mf;

  FloatType T = 0;
  for(const FloatType &tf: Tf){
    T += tf;
  }

  return T;
}

//Parallel bitwise deterministic summation.
//Algorithm 9 from Demmel and Nguyen (2013).
template<class FloatType>
FloatType parallel_bitwise_deterministic_summation(
  std::vector<FloatType> vec,  // Note that we're making a copy!
  const int k
){
  constexpr FloatType eps = std::numeric_limits<FloatType>::epsilon();
  const auto n = vec.size();
  const auto adr = SetRoundingMode(ROUNDING_MODE);

  if(n==0){
    return 0;
  }

  if(!is_error_bound_appropriate<FloatType>(vec.size(), k)){
    std::cout<<"WARNING! Error bounds of deterministic sum are large relative to conventional summation!"<<std::endl;
  }

  std::vector<FloatType> Tf(k);

  // Note that this reduction would require communication if we had
  // implemented this to work on multiple nodes
  FloatType m = std::abs(vec.front());
  #pragma omp parallel for default(none) reduction(max:m) shared(vec)
  for(size_t i=0;i<vec.size();i++){
    m = std::max(m, std::abs(vec[i]));
  }

  // Note that this reduction would require communication if we had
  // implemented this to work on multiple nodes
  #pragma omp declare reduction(vec_plus : std::vector<FloatType> : \
    std::transform(omp_out.begin(), omp_out.end(), omp_in.begin(), omp_out.begin(), std::plus<FloatType>())) \
    initializer(omp_priv = decltype(omp_orig)(omp_orig.size()))

  #pragma omp parallel default(none) reduction(vec_plus:Tf) shared(k,m,n,vec,std::cout)
  {
    const auto adr = SetRoundingMode(ROUNDING_MODE);
    const auto threads = omp_get_num_threads();
    const auto tid = omp_get_thread_num();
    const auto values_per_thread = n / threads;
    const auto nlow = tid * values_per_thread;
    const auto nhigh = (tid<threads-1) ? ((tid+1) * values_per_thread) : n;

    FloatType delta_f = n * m / (1 - 4 * (n + 1) * eps);
    FloatType Mf = mf_from_deltaf(delta_f);

    for(int f=0;f<k-1;f++){
      Tf[f] = ExtractVectorNew2<FloatType>(Mf, vec.begin() + nlow, vec.begin() + nhigh);
      delta_f = n * (4 * eps * Mf / 3) / (1 - 4 * (n + 1) * eps);
      Mf = mf_from_deltaf(delta_f);
    }

    FloatType M = Mf;
    for(size_t i=nlow;i<nhigh;i++){
      M += vec[i];
    }
    Tf[k-1] = M - Mf;
  }

  FloatType T = 0;
  for(const FloatType &tf: Tf){
    T += tf;
  }

  return T;
}



//Convenience wrappers
template<bool Parallel, class FloatType>
FloatType bitwise_deterministic_summation(
  const std::vector<FloatType> &vec,  // Note that we're making a copy!
  const int k
){
  if(Parallel){
    return parallel_bitwise_deterministic_summation<FloatType>(vec, k);
  } else {
    return serial_bitwise_deterministic_summation<FloatType>(vec, k);
  }
}

template<bool Parallel, class FloatType, class SimpleAccumType>
FloatType simple_summation(const std::vector<FloatType> &vec){
  if(Parallel){
    return parallel_simple_summation<FloatType, SimpleAccumType>(vec);
  } else {
    return serial_simple_summation<FloatType, SimpleAccumType>(vec);
  }
}

template<bool Parallel, class FloatType>
FloatType kahan_summation(const std::vector<FloatType> &vec){
  if(Parallel){
    return serial_kahan_summation<FloatType>(vec);
  } else {
    return parallel_kahan_summation<FloatType>(vec);
  }
}



// Timing tests for the summation algorithms
template<bool Parallel, class FloatType, class SimpleAccumType>
FloatType PerformTestsOnData(
  const int TESTS,
  std::vector<FloatType> floats, //Make a copy so we use the same data for each test
  std::mt19937 gen               //Make a copy so we use the same data for each test
){
  Timer time_deterministic;
  Timer time_kahan;
  Timer time_simple;

  //Very precise output
  std::cout.precision(std::numeric_limits<FloatType>::max_digits10);
  std::cout<<std::fixed;

  std::cout<<"Parallel? "<<Parallel<<std::endl;
  if(Parallel){
    std::cout<<"Max threads = "<<omp_get_max_threads()<<std::endl;
  }
  std::cout<<"Floating type                        = "<<typeid(FloatType).name()<<std::endl;
  std::cout<<"Floating type epsilon                = "<<std::numeric_limits<FloatType>::epsilon()<<std::endl;
  std::cout<<"Simple summation accumulation type   = "<<typeid(SimpleAccumType).name()<<std::endl;
  std::cout<<"Number of tests                      = "<<TESTS<<std::endl;
  std::cout<<"Input sample = "<<std::endl;
  for(size_t i=0;i<10;i++){
    std::cout<<"\t"<<floats[i]<<std::endl;
  }

  //Get a reference value
  std::unordered_map<FloatType, uint32_t> simple_sums;
  std::unordered_map<FloatType, uint32_t> kahan_sums;
  const auto ref_val = bitwise_deterministic_summation<Parallel, FloatType>(floats, 2);
  for(int test=0;test<TESTS;test++){
    std::shuffle(floats.begin(), floats.end(), gen);

    time_deterministic.start();
    const auto my_val = bitwise_deterministic_summation<Parallel, FloatType>(floats, 2);
    time_deterministic.stop();
    if(ref_val!=my_val){
      std::cout<<"ERROR: UNEQUAL VALUES ON TEST #"<<test<<"!"<<std::endl;
      std::cout<<"Reference      = "<<ref_val                   <<std::endl;
      std::cout<<"Current        = "<<my_val                    <<std::endl;
      std::cout<<"Reference bits = "<<binrep<FloatType>(ref_val)<<std::endl;
      std::cout<<"Current   bits = "<<binrep<FloatType>(my_val) <<std::endl;
      throw std::runtime_error("Values were not equal!");
    }

    time_kahan.start();
    const auto kahan_sum = kahan_summation<Parallel, FloatType>(floats);
    kahan_sums[kahan_sum]++;
    time_kahan.stop();

    time_simple.start();
    const auto simple_sum = simple_summation<Parallel, FloatType, SimpleAccumType>(floats);
    simple_sums[simple_sum]++;
    time_simple.stop();
  }

  std::cout<<"Average deterministic summation time = "<<(time_deterministic.total/TESTS)<<std::endl;
  std::cout<<"Average simple summation time        = "<<(time_simple.total/TESTS)<<std::endl;
  std::cout<<"Average Kahan summation time         = "<<(time_kahan.total/TESTS)<<std::endl;
  std::cout<<"Ratio Deterministic to Simple        = "<<(time_deterministic.total/time_simple.total)<<std::endl;
  std::cout<<"Ratio Deterministic to Kahan         = "<<(time_deterministic.total/time_kahan.total)<<std::endl;

  std::cout<<"Reference value                      = "<<std::fixed<<ref_val<<std::endl;
  std::cout<<"Reference bits                       = "<<binrep<FloatType>(ref_val)<<std::endl;

  std::cout<<"Distinct Kahan values                = "<<kahan_sums.size()<<std::endl;
  std::cout<<"Distinct Simple values               = "<<simple_sums.size()<<std::endl;

  int count = 0;
  for(const auto &kv: kahan_sums){
    std::cout<<"\tKahan sum values (N="<<std::fixed<<kv.second<<") "<<kv.first<<" ("<<binrep<FloatType>(kv.first)<<")"<<std::endl;
    if(count++==10){
      break;
    }
  }

  count = 0;
  for(const auto &kv: simple_sums){
    std::cout<<"\tSimple sum values (N="<<std::fixed<<kv.second<<") "<<kv.first<<" ("<<binrep<FloatType>(kv.first)<<")"<<std::endl;
    if(count++==10){
      break;
    }
  }

  std::cout<<std::endl;

  return ref_val;
}

// Use this to make sure the tests are reproducible
template<class FloatType, class SimpleAccumType>
void PerformTests(
  const int TESTS,
  const std::vector<long double> &long_floats,
  std::mt19937 &gen
){
  std::vector<FloatType> floats(long_floats.begin(), long_floats.end());

  const auto serial_val = PerformTestsOnData<false, FloatType, SimpleAccumType>(TESTS, floats, gen);
  const auto parallel_val = PerformTestsOnData<true, FloatType, SimpleAccumType>(TESTS, floats, gen);

  //Note that the `long double` type may only use 12-16 bytes (to maintain
  //alignment), but only 80 bits, resulting in bitwise indeterminism in the last
  //few bits; however, the floating-point values themselves will be equal.
  std::cout<<"########################################"<<std::endl;
  std::cout<<"### Serial and Parallel values match for "
           <<typeid(FloatType).name()
           <<"? "
           <<(serial_val==parallel_val)
           <<std::endl;
  std::cout<<"########################################\n"<<std::endl;
}



int main(){
  std::random_device rd;
  // std::mt19937 gen(rd());   //Enable for randomness
  std::mt19937 gen(123456789); //Enable for reproducibility
  std::uniform_real_distribution<long double> distr(-1000, 1000);
  std::vector<long double> long_floats;
  for(int i=0;i<N;i++){
    long_floats.push_back(distr(gen));
  }

  PerformTests<double, double>(TESTS, long_floats, gen);
  PerformTests<long double, long double>(TESTS, long_floats, gen);
  PerformTests<float, float>(TESTS, long_floats, gen);
  PerformTests<float, double>(TESTS, long_floats, gen);

  return 0;
}

Edit for Eric Postpischil

Eric says:

A generator such as I described will produce numbers with largely the same quantum—all near multiples of the divisor. This may not include a wide variety of exponents in the samples, so it may not well model a distribution where the numbers are being rounded in the interior of the significand when being added. E.g., if we add many numbers of the form 123.45, they will add fine for a while, although rounding errors will grow as the partial sums accumulate. But if we add numbers of forms 12345, 123.45, and 1.2345, different errors occur earlier

Adding 1M values of 123.45 gives a single long double Kahan value of 123450000.000000002837623469532. The deterministic long double value differs from this by -0.00000000000727595761 while the deterministic double value differs by -0.00000001206353772047 while the deterministic float differs by -3.68% (as expected given its large epsilon).

Choosing 1M values randomly from the set {1.2345, 12.345, 123.45, 1234.5, 12345} gives two long double Kahan values: A=2749592287.563000000780448317528 (N=54) and B=2749592287.563000000547617673874 (N=46). The deterministic long double value matches (A) exactly; the deterministic double value differs from (A) by -0.00000020139850676247; the deterministic float value differs from (A) by -257% (again, expected).

like image 93
Richard Avatar answered Oct 12 '22 21:10

Richard


TL/DR: There is simple and systematic way, even though it is not an efficient one: Convert everything to fixed-point and get the exact value. Since only the last step of conversion back to float contains rounding, the whole process is independent of the order of summands.

The idea is taken from The Exact Dot Product as Basic Tool for Long Interval Arithmetic.

The observation is that all floating point numbers can also be represented as fixed point numbers of appropriate width. With fixed point numbers multiplication and addition can be represented exactly provided that a few extra integer bits are added to account for overflow.

For instance a 64 bit binary IEE754 float can be represented as 2131 bit fixed point number with 1056 bits before and 1075 bits after the decimal point (see Table 1 in link for other FP formats, but note that the quoted numbers are factor 2 too high as they are for a product of two floats).

Adding 64 bits for overflow (assuming that a maximum of 2^64 numbers can be added at once) we arrive at at 2195 bit or 275 bytes fixed point width. In terms of 64 bit unsigned integers this is 18 before decimal point and 17 after decimal point.

For float32 it is only 3 before and 3 after decimal point, or, if one allows a decimal point in between one limbs, only 5 in total are needed.

Such arithmetic is straightforward to implement in many languages. Typically no dynamic mememory management is needed, and when unsigned integer overflow can be reliably detected (such as in C & C++) no assembly language is needed for efficient implementation.

The principle can be applied to more complex floating point problems where correct rounding is to be obtained. It is not very efficient but simple to implement and rather robubst (in the sense of testability).

EDIT: Corrected the numbers as they were factor of 2 too high. Also added float32 case (suggested from njuffa)

like image 3
Andreas H. Avatar answered Oct 12 '22 21:10

Andreas H.