I'm not very experienced with Rust and I'm trying to diagnose a performance problem. Below there is a pretty fast Java code (runs in 7 seconds) and what I think should the the equivalent Rust code. However, the Rust code runs very slowly (yes, I compiled it with --release as well), and it also appears to overflow. Changing i32 to i64 just pushes the overflow later, but it still happens. I suspect there is some bug in what I wrote, but after staring at the problem for a long time, I decided to ask for help.
public class Blah {
static final int N = 100;
static final int K = 50;
public static void main(String[] args) {
//initialize S
int[] S = new int[N];
for (int n = 1; n <= N; n++) S[n-1] = n*n;
// compute maxsum and minsum
int maxsum = 0;
int minsum = 0;
for (int n = 0; n < K; n++) {
minsum += S[n];
maxsum += S[N-n-1];
}
// initialize x and y
int[][] x = new int[K+1][maxsum+1];
int[][] y = new int[K+1][maxsum+1];
y[0][0] = 1;
// bottom-up DP over n
for (int n = 1; n <= N; n++) {
x[0][0] = 1;
for (int k = 1; k <= K; k++) {
int e = S[n-1];
for (int s = 0; s < e; s++) x[k][s] = y[k][s];
for (int s = 0; s <= maxsum-e; s++) {
x[k][s+e] = y[k-1][s] + y[k][s+e];
}
}
int[][] t = x;
x = y;
y = t;
}
// sum of unique K-subset sums
int sum = 0;
for (int s = minsum; s <= maxsum; s++) {
if (y[K][s] == 1) sum += s;
}
System.out.println(sum);
}
}
extern crate ndarray;
use ndarray::prelude::*;
use std::mem;
fn main() {
let numbers: Vec<i32> = (1..101).map(|x| x * x).collect();
let deg: usize = 50;
let mut min_sum: usize = 0;
for i in 0..deg {
min_sum += numbers[i] as usize;
}
let mut max_sum: usize = 0;
for i in deg..numbers.len() {
max_sum += numbers[i] as usize;
}
// Make an array
let mut x = OwnedArray::from_elem((deg + 1, max_sum + 1), 0i32);
let mut y = OwnedArray::from_elem((deg + 1, max_sum + 1), 0i32);
y[(0, 0)] = 1;
for n in 1..numbers.len() + 1 {
x[(0, 0)] = 1;
println!("Completed step {} out of {}", n, numbers.len());
for k in 1..deg + 1 {
let e = numbers[n - 1] as usize;
for s in 0..e {
x[(k, s)] = y[(k, s)];
}
for s in 0..max_sum - e + 1 {
x[(k, s + e)] = y[(k - 1, s)] + y[(k, s + e)];
}
}
mem::swap(&mut x, &mut y);
}
let mut ans = 0;
for s in min_sum..max_sum + 1 {
if y[(deg, s)] == 1 {
ans += s;
}
}
println!("{}", ans);
}
To diagnose a performance issue in general, I:
The last step is the hard part.
In your case, you have a separate implementation that you can use as your baseline. Comparing the two implementations, we can see that your data structures differ. In Java, you are building nested arrays, but in Rust you are using the ndarray crate. I know that crate has a good maintainer, but I personally don't know anything about the internals of it, or what use cases it best fits.
So I rewrote it with using the standard-library Vec.
The other thing I know is that direct array access isn't as fast as using an iterator. This is because array access needs to perform a bounds check, while iterators bake the bounds check into themselves. Many times this means using methods on Iterator.
The other change is to perform bulk data transfer when you can. Instead of copying element-by-element, move whole slices around, using methods like copy_from_slice.
With those changes the code looks like this (apologies for poor variable names, I'm sure you can come up with semantic names for them):
use std::mem;
const N: usize = 100;
const DEGREE: usize = 50;
fn main() {
let numbers: Vec<_> = (1..N+1).map(|v| v*v).collect();
let min_sum = numbers[..DEGREE].iter().fold(0, |a, &v| a + v as usize);
let max_sum = numbers[DEGREE..].iter().fold(0, |a, &v| a + v as usize);
// different data types for x and y!
let mut x = vec![vec![0; max_sum+1]; DEGREE+1];
let mut y = vec![vec![0; max_sum+1]; DEGREE+1];
y[0][0] = 1;
for &e in &numbers {
let e2 = max_sum - e + 1;
let e3 = e + e2;
x[0][0] = 1;
for k in 0..DEGREE {
let current_x = &mut x[k+1];
let prev_y = &y[k];
let current_y = &y[k+1];
// bulk copy
current_x[0..e].copy_from_slice(¤t_y[0..e]);
// more bulk copy
current_x[e..e3].copy_from_slice(&prev_y[0..e2]);
// avoid array index
for (x, y) in current_x[e..e3].iter_mut().zip(¤t_y[e..e3]) {
*x += *y;
}
}
mem::swap(&mut x, &mut y);
}
let sum = y[DEGREE][min_sum..max_sum+1].iter().enumerate().filter(|&(_, &v)| v == 1).fold(0, |a, (i, _)| a + i + min_sum);
println!("{}", sum);
println!("{}", sum == 115039000);
}
On OS X 10.11.5 with a 2.3 GHz Intel Core i7.
I'm not experienced enough with Java to know what kinds of optimizations it can do automatically.
The biggest potential next step I see is to leverage SIMD instructions when performing the addition; it's pretty much exactly what SIMD is made for.
As pointed out by Eli Friedman, avoiding array indexing by zipping isn't currently the most performant way of doing this.
With the changes below, the time is now 1.267s.
let xx = &mut current_x[e..e3];
xx.copy_from_slice(&prev_y[0..e2]);
let yy = ¤t_y[e..e3];
for i in 0..(e3-e) {
xx[i] += yy[i];
}
This generates assembly that appears to unroll the loop as well as using SIMD instructions:
+0x9b0 movdqu -48(%rsi), %xmm0
+0x9b5 movdqu -48(%rcx), %xmm1
+0x9ba paddd %xmm0, %xmm1
+0x9be movdqu %xmm1, -48(%rsi)
+0x9c3 movdqu -32(%rsi), %xmm0
+0x9c8 movdqu -32(%rcx), %xmm1
+0x9cd paddd %xmm0, %xmm1
+0x9d1 movdqu %xmm1, -32(%rsi)
+0x9d6 movdqu -16(%rsi), %xmm0
+0x9db movdqu -16(%rcx), %xmm1
+0x9e0 paddd %xmm0, %xmm1
+0x9e4 movdqu %xmm1, -16(%rsi)
+0x9e9 movdqu (%rsi), %xmm0
+0x9ed movdqu (%rcx), %xmm1
+0x9f1 paddd %xmm0, %xmm1
+0x9f5 movdqu %xmm1, (%rsi)
+0x9f9 addq $64, %rcx
+0x9fd addq $64, %rsi
+0xa01 addq $-16, %rdx
+0xa05 jne "slow::main+0x9b0"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With