Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast little-endian to big-endian conversion in ASM

I have an array of uint-types in C#, After checking if the program is working on a little-endian machine, I want to convert the data to a big-endian type. Because the amount of data can become very large but is always even, I was thinking to consider two uint types as an ulong type, for a better performance and program it in ASM, so I am searching for a very fast (the fastest if possible) Assembler-algorithm to convert little-endian in big-endian.

like image 832
Willem Van Onsem Avatar asked Aug 31 '09 18:08

Willem Van Onsem


3 Answers

For a large amount of data, the bswap instruction (available in Visual C++ under the _byteswap_ushort, _byteswap_ulong, and _byteswap_uint64 intrinsics) is the way to go. This will even outperform handwritten assembly. These are not available in pure C# without P/Invoke, so:

  1. Only use this if you have a lot of data to byte swap.
  2. You should seriously consider writing your lowest level application I/O in managed C++ so you can do your swapping before ever bringing the data into a managed array. You already have to write a C++ library, so there's not much to lose and you sidestep all the P/Invoke-related performance issues for low-complexity algorithms operating on large datasets.

PS: Many people are unaware of the byte swap intrinsics. Their performance is astonishing, doubly so for floating point data because it processes them as integers. There is no way to beat it without hand coding your register loads for every single byte swap use case, and should you try that, you'll probably incur a bigger hit in the optimizer than you'll ever pick up.

like image 96
Sam Harwell Avatar answered Sep 27 '22 18:09

Sam Harwell


You may want to simply rethink the problem, this should not be a bottleneck. Take the naive algorithm (written in CLI assembly, just for fun). lets assume the number we want is in local number 0

LDLOC 0
SHL 24
LDLOC 0
LDC.i4 0x0000ff00
SHL 8
OR
LDLOC 0
LDC.i4 0x00ff0000
SHL.UN 8
OR
LDLOC 0
SHL.UN 24
OR

At most that's 13 (x86) assembly instructions per number (and most likely the interpreter will be even smarter by using clever registers). And it doesn't get more naive than that.

Now, compare that to the costs of

  • Getting the data loaded in (including whatever peripherals you are working with!)
  • Maniuplation of the data (doing comparisons, for instance)
  • Outputting the result (whatever it is)

If 13 instructions per number is a significant chunk of your execution time, then you are doing a VERY high performance task and should have your input in the correct format! You also probably would not be using a managed language because you would want far more control over buffers of data and what-not, and no extra array bounds checking.

If that array of data comes across a network, I would expect there to be much greater costs from the managing of sockets than from a mere byte order flip, if its from disk, consider pre-flipping before executing this program.

like image 23
RH. Avatar answered Sep 27 '22 17:09

RH.


I was thinking to consider two uint types as an ulong type

Well, that would also swap the two uint values, which might not be desirable...

You could try some C# code in unsafe mode, which may actually perform well enough. Like:

public static unsafe void SwapInts(uint[] data) {
   int cnt = data.Length;
   fixed (uint* d = data) {
      byte* p = (byte*)d;
      while (cnt-- > 0) {
         byte a = *p;
         p++;
         byte b = *p;
         *p = *(p + 1);
         p++;
         *p = b;
         p++;
         *(p - 3) = *p;
         *p = a;
         p++;
      }
   }
}

On my computer the throughput is around 2 GB per second.

like image 41
Guffa Avatar answered Sep 27 '22 17:09

Guffa