This one puzzles me for 3 days. I have an application which needs to evaluate a certain set of integer polynomials (of multiple args) which have very few elements. I already have an implementation written in Java and I am currently porting to C++.
During testing, I noticed that the C++ version is orders of magnitudes slower that the Java variant. I know of course about JIT-ing and that this scenario is particulary well-posed for this kind of compilers, but what I see is way off from what I had expected.
The sample code is below, you'll need boost to compile the C++ code (but that dependency is only required for simple time measurement).
choeger@daishi ~/uebb % clang++ -O3 -std=c++11 polytest.cpp -lboost_timer -lboost_system
choeger@daishi ~/uebb % ./a.out
0.011694s wall, 0.010000s user + 0.000000s system = 0.010000s CPU (85.5%)
Ideal Result: 1e+07
0.421986s wall, 0.420000s user + 0.000000s system = 0.420000s CPU (99.5%)
Result: 1e+07
choeger@daishi ~/uebb % javac PolyTest.java
choeger@daishi ~/uebb % java PolyTest
evals: 10000000 runtime: 17ms
Ideal Result: 1.0E7
evals: 10000000 runtime: 78ms
Result: 1.0E7
Apparently, the C++ version (compiled with clang-3.3) runs slightly faster when it comes to pure computational power, but Java (openjdk 1.7.0.60) does much better when the polynomial is interpreted. My guess so far is, that my C++ code is not quite optimal due to the iteration over small (in the sample 1-element) vectors. I assume the JVM does much better here when it comes to cache-hits misses.
Is there any way to make my C++ version perform better? Is there a different cause I just did not see? And as a side note: is there way to measure cache-coherence for a C++ and a Java process?
The C++ code looks like this:
#include <boost/timer/timer.hpp>
#include <iostream>
#include <vector>
using namespace std;
struct Product {
int factor;
vector<int> fields;
};
class SumOfProducts {
public:
vector<Product> sum;
/**
* evaluate the polynomial with arguments separated by width
*/
inline double eval(const double* arg, const int width) const {
double res = 0.0;
for (Product p : sum) {
double prod = p.factor;
for (int f : p.fields) {
prod *= arg[f*width];
}
res += prod;
}
return res;
};
};
double idealBenchmark(const double* arg, const int width) {
boost::timer::auto_cpu_timer t;
double res = 0.0;
// run 10M evaluations
for (long l = 0; l < 10000000; l++) {
res = res + arg[width] * arg[width];
}
return res;
}
double benchmark(const double* arg, const SumOfProducts& poly) {
boost::timer::auto_cpu_timer t;
double res = 0.0;
// run 10M evaluations
for (long l = 0; l < 10000000; l++) {
res = res + poly.eval(arg, 1);
}
return res;
}
int main() {
//simple polynomial: x_1^2
Product p;
p.factor = 1;
p.fields.push_back(1);
p.fields.push_back(1);
SumOfProducts poly;
poly.sum.push_back(p);
double arg[] = { 0, 1 };
double res = idealBenchmark(arg, 1);
cout << "Ideal Result: " << res << endl;
res = benchmark(arg, poly);
cout << "Result: " << res << endl;
}
The Java version like this:
public class PolyTest {
static class Product {
public final int factor;
public final int[] fields;
public Product(int pFactor, int[] pFields) {
factor = pFactor;
fields = pFields;
}
}
static class SumOfProducts {
final Product[] sum;
public SumOfProducts(Product[] pSum) {
sum = pSum;
}
/**
* evaluate the polynomial with arguments separated by width
*/
double eval(final double[] arg, final int width) {
double res = 0.0;
for (Product p : sum) {
double prod = p.factor;
for (int f : p.fields) {
prod *= arg[f*width];
}
res += prod;
}
return res;
}
}
static double idealBenchmark(final double[] arg, final int width) {
final long start = System.currentTimeMillis();
double res = 0.0;
long evals = 0;
// run 10M evaluations
for (long l = 0; l < 10000000; l++) {
evals++;
res = res + arg[width] * arg[width];
}
System.out.println("evals: " + evals + " runtime: " + (System.currentTimeMillis() - start) + "ms");
return res;
}
static double benchmark(final double[] arg, final SumOfProducts poly) {
final long start = System.currentTimeMillis();
double res = 0.0;
long evals = 0;
// run 10M evaluations
for (long l = 0; l < 10000000; l++) {
evals++;
res = res + poly.eval(arg, 1);
}
System.out.println("evals: " + evals + " runtime: " + (System.currentTimeMillis() - start) + "ms");
return res;
}
public static void main(String[] args) {
//simple polynomial: x_1^2
Product p = new Product(1, new int[]{1, 1});
SumOfProducts poly = new SumOfProducts(new Product[]{p});
double arg[] = { 0, 1 };
double res = idealBenchmark(arg, 1);
System.out.println("Ideal Result: " + res);
res = benchmark(arg, poly);
System.out.println("Result: " + res);
}
}
C is a low-level, executable-compiled language, that can do low-level hardware operations such as dynamic memory allocation, and accessing hardware drivers at a very low level. C is normally faster than Java, if it is written efficiently, but it always depends on the compiler.
Java startup time is often much slower than many languages, including C, C++, Perl or Python, because many classes (and first of all classes from the platform Class libraries) must be loaded before being used.
Java vs. C++ performance. In contrast, a program written in C++ gets compiled directly into machine code -- without an intermediary translation required at runtime. This is one reason why C++ programs tend to perform faster than those written in Java.
Speed and performance Java is a favorite among developers, but because the code must first be interpreted during run-time, it's also slower. C++ is compiled to binaries, so it runs immediately and therefore faster than Java programs.
You are making expensive copies here:
for (Product p : sum)
Each copy means fully copying the std::vector<int>
data member of each element. Use references instead:
for (const Product& p : sum)
Note that I made them const
, because you do not need to change the elements of the range.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With