Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

An SSE Stdlib-esque Library?

Generally everything I come across 'on-the-net' with relation to SSE/MMX comes out as maths stuff for vectors and matracies. However, I'm looking for libraries of SSE optimized 'standard functions', like those provided by Agner Fog, or some of the SSE based string scanning algorithms in GCC.

As a quick general rundown: these would be things like memset, memcpy, strstr, memcmp BSR/BSF, ie an stdlib-esque built from SSE intrsuctions

I'd preferably like them to be for SSE1 (formally MMX2) using intrinsics rather than assembly, but either is fine. hopefully this not too broad a spectrum.

Update 1

I came across some promising stuff after some searching, one library caught my eye:

  • LibFreeVec: seems mac/IBM only (due to being AltiVec based), thus of little use(to me), plus I can't seem to find a direct download link, nor does it state the minimum supported SSE version

I also came across an article on a few vectorised string functions(strlen, strstr strcmp). However SSE4.2 is way out of my reach (as said before, I'd like to stick to SSE1/MMX).

Update 2

Paul R motivated me to do a little benchmarking, unfortunately as my SSE assembly coding experience is close to zip, I used someone else's (http://www.mindcontrol.org/~hplus/) benchmarking code and added to it. All tests(excluding the original, which is VC6 SP5) where compiled under VC9 SP1 with full/customized optimizations and /arch:SSE on.

First test was one my home machine (AMD Sempron 2200+ 512mb DDR 333), capped at SSE1 (thus no vectorization by MSVC memcpy):

comparing P-III SIMD copytest (blocksize 4096) to memcpy
calculated CPU speed: 1494.0 MHz
  size  SSE Cycles      thru-sse    memcpy Cycles   thru-memcpy     asm Cycles      thru-asm
   1 kB 2879        506.75 MB/s     4132        353.08 MB/s     2655        549.51 MB/s
   2 kB 4877        598.29 MB/s     7041        414.41 MB/s     5179        563.41 MB/s
   4 kB 8890        656.44 MB/s     13123       444.70 MB/s     9832        593.55 MB/s
   8 kB 17413       670.28 MB/s     25128       464.48 MB/s     19403       601.53 MB/s
  16 kB 34569       675.26 MB/s     48227       484.02 MB/s     38303       609.43 MB/s
  32 kB 68992       676.69 MB/s     95582       488.44 MB/s     75969       614.54 MB/s
  64 kB 138637      673.50 MB/s     195012      478.80 MB/s     151716      615.44 MB/s
 128 kB 277678      672.52 MB/s     400484      466.30 MB/s     304670      612.94 MB/s
 256 kB 565227      660.78 MB/s     906572      411.98 MB/s     618394      603.97 MB/s
 512 kB 1142478     653.82 MB/s     1936657     385.70 MB/s     1380146     541.23 MB/s
1024 kB 2268244     658.64 MB/s     3989323     374.49 MB/s     2917758     512.02 MB/s
2048 kB 4556890     655.69 MB/s     8299992     359.99 MB/s     6166871     484.51 MB/s
4096 kB 9307132     642.07 MB/s     16873183        354.16 MB/s     12531689    476.86 MB/s

full tests

Second test batch was done on a university workstation(Intel E6550, 2.33Ghz, 2gb DDR2 800?)

VC9 SSE/memcpy/ASM:
comparing P-III SIMD copytest (blocksize 4096) to memcpy
calculated CPU speed: 2327.2 MHz
  size  SSE Cycles      thru-sse    memcpy Cycles   thru-memcpy     asm Cycles      thru-asm
   1 kB 392         5797.69 MB/s    434         5236.63 MB/s    420         5411.18 MB/s
   2 kB 882         5153.51 MB/s    707         6429.13 MB/s    714         6366.10 MB/s
   4 kB 2044        4447.55 MB/s    1218        7463.70 MB/s    1218        7463.70 MB/s
   8 kB 3941        4613.44 MB/s    2170        8378.60 MB/s    2303        7894.73 MB/s
  16 kB 7791        4667.33 MB/s    4130        8804.63 MB/s    4410        8245.61 MB/s
  32 kB 15470       4701.12 MB/s    7959        9137.61 MB/s    8708        8351.66 MB/s
  64 kB 30716       4735.40 MB/s    15638       9301.22 MB/s    17458       8331.57 MB/s
 128 kB 61019       4767.45 MB/s    31136       9343.05 MB/s    35259       8250.52 MB/s
 256 kB 122164      4762.53 MB/s    62307       9337.80 MB/s    72688       8004.21 MB/s
 512 kB 246302      4724.36 MB/s    129577      8980.15 MB/s    142709      8153.80 MB/s
1024 kB 502572      4630.66 MB/s    332941      6989.95 MB/s    290528      8010.38 MB/s
2048 kB 1105076     4211.91 MB/s    1384908     3360.86 MB/s    662172      7029.11 MB/s
4096 kB 2815589     3306.22 MB/s    4342289     2143.79 MB/s    2172961     4284.00 MB/s

full tests

As can be seen, SSE is very fast on my home system, but falls on the intel machine (probably due to bad coding?). my x86 assembly variant comes in second on my home machine, and second on the intel system (but the results look a bit inconsistent, one hug blocks it dominates the SSE1 version). the MSVC memcpy wins the intel system tests hands done, this is due to SSE2 vectorization though, on my home machine, it fails dismally, even the horrible __movsd beats it...

pitfalls: the memory was all aligned powers of 2. cache was (hopefully) flushed. rdtsc was used for timing.

points of interest: MSVC has an (unlisted in any reference) __movsd intrinsic, it outputs the same assembly code I'm using, but it fails dismally(even when inlined!). That's probably why its unlisted.

VC9 memcpy can be forced to vectorize on my non-sse 2 machine, it will however corrupt the FPU stack, it also seems to have a bug.

This is the full source to what I used to test (including my alterations, again, credit to http://www.mindcontrol.org/~hplus/ for the original). The binaries an project files are available on request.

In conclusion, it seems a switching variant might be the best, similar to the MSVC crt one, only a lot more sturdy with more options and single once-off checks (via inline'd function pointers? or something more devious like internal direct call patch), however inlining would probably have to use a best case method instead

Update 3

A question asked by Eshan reminded of something useful and related to this, although only for bit sets and bit ops, BitMagic and be quite useful for large bit sets, it even has a nice article on SSE2 (bit) optimization. Unfortunatly, this still isn't a CRT/stdlib esque type library. its seems most of these projects are dedicated to a specific, small section (of problems).

This raises the question, would it then rather be worth will to create a open-source, probably multi-platform performance crt/stdlib project, creating various versions of the standardised functions, each optimized for certain situation as well as a 'best-case'/general use variant of the function, with either runtime branching for scalar/MMX/SSE/SSE2+ (à la MSVC) or a forced compile time scalar/SIMD swich.

This could be useful for HPC, or projects where every bit of performance counts (like games), freeing the programmer from worrying about the speed of the inbuilt functions, only requiring a small bit of tuning to find the optimal optimized variant.

Update 4

I think the nature of this question should be expanded, to include techniques that can be applied using SSE/MMX to optimization for non-vector/matrix applications, this could probably be used for 32/64bit scalar code as well. A good example is how to check for the occurence of a byte in a given 32/64/128/256 bit data type, at once using scalar techniques(bit manip), MMX & SSE/SIMD

Also, I see a lot of answers along the lines of "just use ICC", and thats a good answer, it not my kinda of answer, as firstly, ICC is not something I can use continuously (unless Intel have a free student version for windows), due to the 30 trial. secondly, and more pertinently, I'm not only after the libraries its self, but the techniques used to optimize/create the functions they contain, for my personal eddification and improvement, and so I can apply such techniques and principles to my own code (where needed), in conjunction with the use of these libraries. hopefully that clears up that part :)

like image 897
Necrolis Avatar asked Oct 05 '10 10:10

Necrolis


1 Answers

Here's an article on how to use SIMD instructions to vectorize the counting of characters:

http://porg.es/blog/ridiculous-utf-8-character-counting

like image 175
carlo Avatar answered Sep 29 '22 08:09

carlo