Diagnosing a performance issue

Question

I'm not very experienced with Rust and I'm trying to diagnose a performance problem. Below there is a pretty fast Java code (runs in 7 seconds) and what I think should the the equivalent Rust code. However, the Rust code runs very slowly (yes, I compiled it with --release as well), and it also appears to overflow. Changing i32 to i64 just pushes the overflow later, but it still happens. I suspect there is some bug in what I wrote, but after staring at the problem for a long time, I decided to ask for help.

public class Blah {

    static final int N = 100;
    static final int K = 50;

    public static void main(String[] args) {
        //initialize S
        int[] S = new int[N];
        for (int n = 1; n <= N; n++) S[n-1] = n*n;

        // compute maxsum and minsum
        int maxsum = 0;
        int minsum = 0;
        for (int n = 0; n < K; n++) {
            minsum += S[n];
            maxsum += S[N-n-1];
        }

        // initialize x and y
        int[][] x = new int[K+1][maxsum+1];
        int[][] y = new int[K+1][maxsum+1];
        y[0][0] = 1;

        // bottom-up DP over n
        for (int n = 1; n <= N; n++) {
            x[0][0] = 1;
            for (int k = 1; k <= K; k++) {
                int e = S[n-1];
                for (int s = 0; s < e; s++) x[k][s] = y[k][s];
                for (int s = 0; s <= maxsum-e; s++) {
                    x[k][s+e] = y[k-1][s] + y[k][s+e];
                }
            }
            int[][] t = x;
            x = y;
            y = t;
        }

        // sum of unique K-subset sums
        int sum = 0;
        for (int s = minsum; s <= maxsum; s++) {
            if (y[K][s] == 1) sum += s;
        }
        System.out.println(sum);
    }

}

extern crate ndarray;

use ndarray::prelude::*;
use std::mem;

fn main() {
    let numbers: Vec<i32> = (1..101).map(|x| x * x).collect();

    let deg: usize = 50;

    let mut min_sum: usize = 0;
    for i in 0..deg {
        min_sum += numbers[i] as usize;
    }

    let mut max_sum: usize = 0;
    for i in deg..numbers.len() {
        max_sum += numbers[i] as usize;
    }

    // Make an array
    let mut x = OwnedArray::from_elem((deg + 1, max_sum + 1), 0i32);
    let mut y = OwnedArray::from_elem((deg + 1, max_sum + 1), 0i32);

    y[(0, 0)] = 1;

    for n in 1..numbers.len() + 1 {
        x[(0, 0)] = 1;
        println!("Completed step {} out of {}", n, numbers.len());
        for k in 1..deg + 1 {
            let e = numbers[n - 1] as usize;
            for s in 0..e {
                x[(k, s)] = y[(k, s)];
            }
            for s in 0..max_sum - e + 1 {
                x[(k, s + e)] = y[(k - 1, s)] + y[(k, s + e)];
            }
        }
        mem::swap(&mut x, &mut y);
    }

    let mut ans = 0;

    for s in min_sum..max_sum + 1 {
        if y[(deg, s)] == 1 {
            ans += s;
        }
    }

    println!("{}", ans);

}

Shepmaster · Accepted Answer

To diagnose a performance issue in general, I:

Get a baseline time or rate. Preferably create a testcase that only takes a few seconds, as profilers tend to slow down the system a bit. You will also want to iterate frequently.
Compile in release mode with debugging symbols.
Run the code in a profiler. I'm on OS X so my main choice is Instruments, but I also use valgrind.
Find the hottest code path, think about why it's slow, try something, measure.

The last step is the hard part.

In your case, you have a separate implementation that you can use as your baseline. Comparing the two implementations, we can see that your data structures differ. In Java, you are building nested arrays, but in Rust you are using the ndarray crate. I know that crate has a good maintainer, but I personally don't know anything about the internals of it, or what use cases it best fits.

So I rewrote it with using the standard-library Vec.

The other thing I know is that direct array access isn't as fast as using an iterator. This is because array access needs to perform a bounds check, while iterators bake the bounds check into themselves. Many times this means using methods on Iterator.

The other change is to perform bulk data transfer when you can. Instead of copying element-by-element, move whole slices around, using methods like copy_from_slice.

With those changes the code looks like this (apologies for poor variable names, I'm sure you can come up with semantic names for them):

use std::mem;

const N: usize = 100;
const DEGREE: usize = 50;

fn main() {
    let numbers: Vec<_> = (1..N+1).map(|v| v*v).collect();

    let min_sum = numbers[..DEGREE].iter().fold(0, |a, &v| a + v as usize);
    let max_sum = numbers[DEGREE..].iter().fold(0, |a, &v| a + v as usize);

    // different data types for x and y!
    let mut x = vec![vec![0; max_sum+1]; DEGREE+1];
    let mut y = vec![vec![0; max_sum+1]; DEGREE+1];
    y[0][0] = 1;

    for &e in &numbers {
        let e2 = max_sum - e + 1;
        let e3 = e + e2;

        x[0][0] = 1;

        for k in 0..DEGREE {
            let current_x = &mut x[k+1];
            let prev_y = &y[k];
            let current_y = &y[k+1];

            // bulk copy
            current_x[0..e].copy_from_slice(&current_y[0..e]);

            // more bulk copy
            current_x[e..e3].copy_from_slice(&prev_y[0..e2]);

            // avoid array index
            for (x, y) in current_x[e..e3].iter_mut().zip(&current_y[e..e3]) {
                *x += *y;
            }
        }

        mem::swap(&mut x, &mut y);
    }

    let sum = y[DEGREE][min_sum..max_sum+1].iter().enumerate().filter(|&(_, &v)| v == 1).fold(0, |a, (i, _)| a + i + min_sum);

    println!("{}", sum);
    println!("{}", sum == 115039000);
}

2.060s - Rust 1.9.0
2.225s - Java 1.7.0_45-b18

On OS X 10.11.5 with a 2.3 GHz Intel Core i7.

I'm not experienced enough with Java to know what kinds of optimizations it can do automatically.

The biggest potential next step I see is to leverage SIMD instructions when performing the addition; it's pretty much exactly what SIMD is made for.

As pointed out by Eli Friedman, avoiding array indexing by zipping isn't currently the most performant way of doing this.

With the changes below, the time is now 1.267s.

let xx = &mut current_x[e..e3];
xx.copy_from_slice(&prev_y[0..e2]);

let yy = &current_y[e..e3];
for i in 0..(e3-e) {
    xx[i] += yy[i];
}

This generates assembly that appears to unroll the loop as well as using SIMD instructions:

+0x9b0    movdqu    -48(%rsi), %xmm0
+0x9b5    movdqu    -48(%rcx), %xmm1
+0x9ba    paddd     %xmm0, %xmm1
+0x9be    movdqu    %xmm1, -48(%rsi)
+0x9c3    movdqu    -32(%rsi), %xmm0
+0x9c8    movdqu    -32(%rcx), %xmm1
+0x9cd    paddd     %xmm0, %xmm1
+0x9d1    movdqu    %xmm1, -32(%rsi)
+0x9d6    movdqu    -16(%rsi), %xmm0
+0x9db    movdqu    -16(%rcx), %xmm1
+0x9e0    paddd     %xmm0, %xmm1
+0x9e4    movdqu    %xmm1, -16(%rsi)
+0x9e9    movdqu    (%rsi), %xmm0
+0x9ed    movdqu    (%rcx), %xmm1
+0x9f1    paddd     %xmm0, %xmm1
+0x9f5    movdqu    %xmm1, (%rsi)
+0x9f9    addq      $64, %rcx
+0x9fd    addq      $64, %rsi
+0xa01    addq      $-16, %rdx
+0xa05    jne       "slow::main+0x9b0"

Diagnosing a performance issue

Tags:

java

performance

multidimensional-array

rust

Sidious Lord

1 Answers

Shepmaster

Recent Activity

Donate For Us

Diagnosing a performance issue

Tags:

java

performance

multidimensional-array

rust

Sidious Lord

1 Answers

Shepmaster

Related questions

Recent Activity

Donate For Us