Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Quickest way (performance-wise) to turn a string into a byte[] array in C# using the ASCII character encoding

What's the fastest way to turn a string into a byte[] array in C#? I'm sending tonnes of string data through sockets and need to optimize every single operation. Currently I transform the strings in to byte[] arrays before sending using:

private static readonly Encoding encoding = new ASCIIEncoding();
//...
byte[] bytes = encoding.GetBytes(someString);
socket.Send(bytes);
//...
like image 425
Nosrama Avatar asked Aug 26 '09 22:08

Nosrama


Video Answer


3 Answers

If all your data is really going to be ASCII, then you may be able to do it slightly faster than ASCIIEncoding, which has various (entirely reasonable) bits of error handling etc. You may also be able to speed it up by avoiding creating new byte arrays all the time. Assuming you have an upper bound which all your messages will be under:

void QuickAndDirtyAsciiEncode(string chars, byte[] buffer)
{
    int length = chars.Length;
    for (int i = 0; i < length; i++)
    {
        buffer[i] = (byte) (chars[i] & 0x7f);
    }
}

You'd then do something like:

readonly byte[] Buffer = new byte[8192]; // Reuse this repeatedly
...
QuickAndDirtyAsciiEncode(text, Buffer);
// We know ASCII takes one byte per character
socket.Send(Buffer, text.Length, SocketFlags.None);

This is pretty desperate optimisation though. I'd stick with ASCIIEncoding until I'd proven that this was the bottleneck (or at least that this sort of grotty hack doesn't help).

like image 152
Jon Skeet Avatar answered Oct 30 '22 15:10

Jon Skeet


I would say that how you are doing it now is plenty good. If you are really concerned with very low level optimization like that, the best recommendation I can make is get Reflector. With reflector, you can look at the code yourself (most of the time), and see what the algorithms are. If reflector does not show you, you could always download Microsofts SSCLI (Shared Source Common Language Infrastructure) to see the C++ code behind MethodImplOptions.InternalCall methods.

For reference, here is the actual implementation of Encoding.ASCII.GetBytes:

public override int GetBytes(string chars, int charIndex, int charCount, byte[] bytes, int byteIndex)
{
    if ((chars == null) || (bytes == null))
    {
        throw new ArgumentNullException();
    }
    if ((charIndex < 0) || (charCount < 0))
    {
        throw new ArgumentOutOfRangeException();
    }
    if ((chars.Length - charIndex) < charCount)
    {
        throw new ArgumentOutOfRangeException();
    }
    if ((byteIndex < 0) || (byteIndex > bytes.Length))
    {
        throw new ArgumentOutOfRangeException();
    }
    if ((bytes.Length - byteIndex) < charCount)
    {
        throw new ArgumentException();
    }
    int num = charIndex + charCount;
    while (charIndex < num)
    {
        char ch = chars[charIndex++];
        if (ch >= '\x0080')
        {
            ch = '?';
        }
        bytes[byteIndex++] = (byte) ch;
    }
    return charCount;
}
like image 44
jrista Avatar answered Oct 30 '22 14:10

jrista


The performance characteristic of implementing a general-purpose memcpy library function using a SIMD register is significantly more colorful than an equivalent implementation using a general-purpose register...

  - Intel 64 and IA-32 Architectures Optimization Reference Manual (April 2018) §3.7.6.1

For screaming speed with converting medium- to larger-sized chunks of data between 8-bit byte[] and "wide" (16-bit, Unicode) text, you'll want to consider solutions which deploy the SIMD instructions PUNPCKLBW+PUNPCKHBW (widening) and PACKUSWB (narrowing). In .NET, these are available as x64 JIT intrinstics, emitted for the hardware-accelerated System.Numerics types Vector and Vector<T> (see here for more info). The generic version Vector<T> is defined in the System.Numerics.Vectors package, which currently remains under fairly active development. As illustrated below, you'll also probably want to include the System.Runtime.CompilerServices.​Unsafe package, since this is preferred SIMD load/store technique recommended by the Vector<T> authors.

The relevant SIMD acceleration is only enabled for capable CPUs in x64 mode, but otherwise .NET provides transparent fallback to emulation code in the System.Numerics.Vectors library, so the code demonstrated here does indeed reliably function across the wider .NET ecosystem, possibly with reduced performance. To test the code shown below, I used a console app on the full .NET Framework 4.8 ("desktop") in x64 (SIMD) and x86 (emulated) modes.

Since I wouldn't want to deprive anyone of the opportunity to learn the relevant techniques, I'll use Vector.Widen to illustrate the byte[] to char[] direction in C# 7. From this example, doing the reverse--i.e, using Vector.Narrow to implement the narrowing direction--is straightforward and is left as an exercise for the reader.

warning:
The methods suggested here are entirely encoding-unaware, they simply strip/expand--or narrow/widen--raw bytes to/from raw bytes without any regard for character mapping, text encoding, or other linguistic properties. When widening, surplus bytes are set to zero, and when narrowing, excess bytes are discarded.

Others have discussed the n̲u̲m̲e̲r̲o̲u̲s̲ h̲a̲z̲a̲r̲d̲s̲ associated with this practice on this page and elsewhere, so please carefully review and understand the nature of this operation before considering whether it is appropriate for your situation. For clarity, inline validation is elided from the code example shown below, but such could be added to the innermost loop with minimal impact on the SIMD benefit.

You have been warned. Although not SIMD-accelerated, canonical techniques using a suitable Encoding instance are recommended for nearly all realistic app scenarios. Although the OP specifically requests a maximum-performance solution, first I'll summarize the proper sanctioned techniques that should normally be used instead.

To widen a byte array to a .NET String, invoke the GetString() method on a suitable byte-oriented encoding instance:

String Encoding.ASCII.GetString(byte[] bytes)

To narrow a .NET String to an (e.g., Ascii) byte array, invoke the GetBytes() method on a suitable byte-oriented encoding instance:

byte[] Encoding.ASCII.GetBytes(char[] chars)

Ok, now on to the fun part--the extremely fast SIMD-enabled ("vectorized") C# code for "dumb" widening of a byte array. As a reminder, here are some dependencies that should be referenced:

// ... 
using System.Numerics;                  // nuget: System.Numerics.Vectors
using System.Runtime.CompilerServices;  // nuget: System.Runtime.CompilerServices.Unsafe
// ... 

Here is the public entry point wrapper function. If you prefer a version that returns char[] instead of String, it's provided at the end of this post.

/// <summary>
/// 'Widen' each byte in 'bytes' to 16-bits with no consideration for
/// character mapping or encoding.
/// </summary>
public static unsafe String ByteArrayToString(byte[] bytes)
{
    // note: possible zeroing penalty; consider buffer pooling or 
    // other ways to allocate target?
    var s = new String('\0', bytes.Length);

    if (s.Length > 0)
        fixed (char* dst = s)
        fixed (byte* src = bytes)
            widen_bytes_simd(dst, src, s.Length);
    return s;
}

Next is the main working loop body. Notice the prologue loop that aligns the destination to a 16-byte memory boundary, if necessary, by bytewise copying of up to 15 source bytes. This ensures the most efficient operation of the main "quad-quadwise" loop which, with a single pairing of SIMD PUNPCKLBW/PUNPCKHBW instructions, writes 32-bytes at once (16 source bytes are fetched and then stored as 16 wide chars occupying 32 bytes). Pre-aligning, plus the choice of dst alignment (as opposed to src) are official recommendations from the Intel manual cited above. Likewise, aligned operation entails that when the main loop completes, the source may have up to 15 residual trailing bytes; these are finished up by a short epilogue loop.

static unsafe void widen_bytes_simd(char* dst, byte* src, int c)
{
    for (; c > 0 && ((long)dst & 0xF) != 0; c--)
        *dst++ = (char)*src++;

    for (; (c -= 0x10) >= 0; src += 0x10, dst += 0x10)
        Vector.Widen(Unsafe.AsRef<Vector<byte>>(src),
                     out Unsafe.AsRef<Vector<ushort>>(dst + 0),
                     out Unsafe.AsRef<Vector<ushort>>(dst + 8));

    for (c += 0x10; c > 0; c--)
        *dst++ = (char)*src++;
}

That's actually all there is to it! It works like a charm and, as you'll see below, it does 'scream' as-advertised.

But first, by turning off the vs2017 debugger option "Disable JIT optimizations," we can examine the native SIMD instruction stream that the x64 JIT generates for the 'release' build on .NET 4.7.2. Here is the relevant part of the main inner loop that blasts through the data 32-bytes at a time. Notice that the JIT has managed to emit the theoretically minimal fetch/store pattern.

L_4223  mov         rax,rbx  
L_4226  movups      xmm0,xmmword ptr [rax] ; fetch 16 bytes
L_4229  mov         rax,rdi  
L_422C  lea         rdx,[rdi+10h]  
L_4230  movaps      xmm2,xmm0  
L_4233  pxor        xmm1,xmm1  
L_4237  punpcklbw   xmm2,xmm1               ; interleave 8-to-16 bits (lo)
L_423B  movups      xmmword ptr [rax],xmm2  ; store 8 bytes (lo) to 8 wide chars (16 bytes)
L_423E  pxor        xmm1,xmm1  
L_4242  punpckhbw   xmm0,xmm1               ; interleave 8-to-16 bits (hi)
L_4246  movups      xmmword ptr [rdx],xmm0  ; store 8 bytes (hi) to 8 wide chars (16 bytes)
L_4249  add         rbx,10h  
L_424D  add         rdi,20h  
L_4251  add         esi,0FFFFFFF0h  
L_4254  test        esi,esi  
L_4256  jge         L_4223  
L_4258  ...

Performance test results:
I tested the SIMD code against four other techniques that perform the same function. For the .NET encoders listed below, this was a call to the GetChars(byte[], int, int) method.

  • naive C# implementation of an unsafe bytewise loop
  • .NET encoding for the "Windows-1252" codepage
  • .NET encoding for ASCII
  • .NET encoding for UTF-8 (no BOM, no throwing)
  • SIMD code shown in this article

The testing included identical work for all and validation of identical results from all of the units under test. Test bytes were random and ASCII-only ([0x01 - 0x7F]) in order to ensure identical result from all test units. Input size was random, maximum 1MB, with a log2 bias towards smaller sizes such that the average size was about 80K.

For fairness, execution order was systematically rotated through the 5 units for each iteration. For warmup, timings were discarded and reset to zero once, at iteration 100. The test harness does not perform any allocations during the test phase and a full GC is forced and awaited each 10000 iterations.

                 Relative ticks, normalized to best result
                  .NET Framework 4.7.3056.0 x64 (release)
   iter      naive      win-1252         ascii         utf-8          simd
-------   -----------  ------------  ------------  ------------   -----------
  10000 |    131.5         294.5         186.2         145.6         100.0
  20000 |    137.7         305.3         191.9         149.4         100.0
  30000 |    139.2         308.5         195.8         151.5         100.0
  40000 |    141.8         312.1         198.5         153.2         100.0
  50000 |    142.0         313.8         199.1         154.1         100.0
  60000 |    140.5         310.6         196.7         153.0         100.0
  70000 |    141.1         312.9         197.3         153.6         100.0
  80000 |    141.6         313.7         197.8         154.1         100.0
  90000 |    141.3         313.7         197.9         154.3         100.0
 100000 |    141.1         313.3         196.9         153.7         100.0
 
gcServer=False; LatencyMode.Interactive; Vector.IsHardwareAccelerated=True 

On the preferred x64 platform when JIT optimization is enabled and SIMD is available, there was no contest. The SIMD code runs about 150% faster than the next contender. The Encoding.Default, which is usually the "Windows-1252" codepage, performed particularly poorly, about 3x slower than the SIMD code.

Earlier I mentioned that the distribution of test data sizes was strongly log-biased towards zero. Without this step--meaning a uniform distribution of sizes from 0 to 1,048,576 bytes (average test size 512K)--SIMD continues to outpace the pack with all other units faring relatively worse vs. the code shown above.

naive       153.45%
win-1252    358.84%
ascii       221.38%
utf-8       161.62%
simd        100.00%

As for the non-SIMD (emulation) case, UTF-8 and SIMD are extremely close--typically within 3-4% of each other--and far better than the rest. I found this result to be doubly surprising: that the UTF8Encoding source code was so fast (lots of fast-path optimization), and then also that the general-purpose SIMD emulation code was able to match that purpose-tuned code.


Addendum:
In the above code, I mentioned a possible O(*n*) performance penalty (associated with excess re-zeroing) from using the `new String(Char,int)` constructor to allocate the target string. For completeness, here is an alternate entry point which might avoid the problem by instead returning the widened data as a `char[]`:
/// <summary>
/// 'Widen' each byte in 'bytes' to 16-bits with no consideration for
/// character mapping or encoding
/// </summary>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static unsafe char[] WidenByteArray(byte[] bytes)
{
    var rgch = new char[bytes.Length];
    if (rgch.Length > 0)
        fixed (char* dst = rgch)
        fixed (byte* src = bytes)
            widen_bytes_simd(dst, src, rgch.Length);
    return rgch;
}
like image 31
Glenn Slayden Avatar answered Oct 30 '22 15:10

Glenn Slayden