I'm currently working as an author and contributor to SSIM.js and jest-image-snapshot on behalf of Not a Typical Agency. A lot of the work I'm performing results in the creation of new algorithms to compare images in Javascript. After a lot of research and testing with AssemblyScript and WebAssembly, I've found that I can often get a higher-performing and more portable solutions with pure JS than I do with either of these two technologies. However, that only happens after performing an extensive code review of the generated assembly and performing many experiments.
What I'd like to understand is if there's a way to get Node.JS/libV8 to automatically vectorize sections of code. As an example, I have a two-pass loop that calculates prefix sums for each pixel in an image horizontally then vertically. Skipping over the horizontal prefix sum (which can be challenging to vectorize for a real performance improvement with pure assembly), the vertical prefix sum should be super simple to optimize. Here's an example:
for (let h = 0; h + 1 < height; ++h) {
for (let w = 0; w < width; ++w) {
let above = pSumArray[h * width + w];
let current = pSumArray[(h + 1) * width + w];
pSumArray[(h + 1) * width + w] = above + current;
}
}
This takes all of the preexisting horizontal prefix sum calculations, and adds adjacent lines in the image going forward, one line at a time, all the way until it reaches the end.
The assembler output looks something like this:
0x11afd3a4adb1 1d1 c4a17b1044c10f vmovsd xmm0,[rcx+r8*8+0xf]
0x11afd3a4adb8 1d8 4c8bc2 REX.W movq r8,rdx
0x11afd3a4adbb 1db 4503c6 addl r8,r14
0x11afd3a4adbe 1de 0f80ce020000 jo 0x11afd3a4b092 <+0x4b2>
0x11afd3a4adc4 1e4 443bc7 cmpl r8,rdi
0x11afd3a4adc7 1e7 0f83d1020000 jnc 0x11afd3a4b09e <+0x4be>
0x11afd3a4adcd 1ed c4a17b104cc10f vmovsd xmm1,[rcx+r8*8+0xf]
0x11afd3a4add4 1f4 c5fb58c1 vaddsd xmm0,xmm0,xmm1
0x11afd3a4add8 1f8 c4a17b1144c10f vmovsd [rcx+r8*8+0xf],xmm0
If you look closely, you can see it's performing all of the load, store, and add operations with 'sd' (single double) operations. This means that it's only working with one double at a time. The documentation on this can be found here.
This is relevant in this application since it's a relative certainty that width is a multiple of 2 if not 16. As a side effect, we could be performing twice as many adds at a time if we're on a machine that only supports SSE2 instructions, and up to 8 adds at a time if the machine supports AVX512. Even though node.js and libv8 seem to do a decent job of runtime detecting CPUs, I still haven't found a way to get it to automatically vectorize a loop like this. I've tried a handful of strategies that include specific conditionals (e.g. is width divisible by 8) and combining them with delooping (e.g. array[0] =above0+current0 array[1]=above1+current1 array[2]=above2+current2...
) but none have yielded any successful results.
Any help someone can provide me on this topic would be greatly appreciated.
Thanks so much,
Dan
V8 doesn't do any auto-vectorization currently. (Source: I'm a V8 developer.) This may or may not change in the future; people are toying with ideas every now and then, but I'm not aware of any specific plans.
Wasm-SIMD is getting close to general availability (it's currently in "origin trial" experimental limited release phase), which will allow you to use SIMD instructions via WebAssembly. (Wasm being much lower level than JavaScript means that it generally gives you better control over what instruction sequence will be generated. The Wasm-SIMD instruction set in particular has been chosen such that it maps pretty well onto common hardware instructions.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With