What's the fastest way to turn a string into a byte[] array in C#? I'm sending tonnes of string data through sockets and need to optimize every single operation. Currently I transform the strings in to byte[] arrays before sending using:
private static readonly Encoding encoding = new ASCIIEncoding();
//...
byte[] bytes = encoding.GetBytes(someString);
socket.Send(bytes);
//...
If all your data is really going to be ASCII, then you may be able to do it slightly faster than ASCIIEncoding
, which has various (entirely reasonable) bits of error handling etc. You may also be able to speed it up by avoiding creating new byte arrays all the time. Assuming you have an upper bound which all your messages will be under:
void QuickAndDirtyAsciiEncode(string chars, byte[] buffer)
{
int length = chars.Length;
for (int i = 0; i < length; i++)
{
buffer[i] = (byte) (chars[i] & 0x7f);
}
}
You'd then do something like:
readonly byte[] Buffer = new byte[8192]; // Reuse this repeatedly
...
QuickAndDirtyAsciiEncode(text, Buffer);
// We know ASCII takes one byte per character
socket.Send(Buffer, text.Length, SocketFlags.None);
This is pretty desperate optimisation though. I'd stick with ASCIIEncoding
until I'd proven that this was the bottleneck (or at least that this sort of grotty hack doesn't help).
I would say that how you are doing it now is plenty good. If you are really concerned with very low level optimization like that, the best recommendation I can make is get Reflector. With reflector, you can look at the code yourself (most of the time), and see what the algorithms are. If reflector does not show you, you could always download Microsofts SSCLI (Shared Source Common Language Infrastructure) to see the C++ code behind MethodImplOptions.InternalCall methods.
For reference, here is the actual implementation of Encoding.ASCII.GetBytes:
public override int GetBytes(string chars, int charIndex, int charCount, byte[] bytes, int byteIndex)
{
if ((chars == null) || (bytes == null))
{
throw new ArgumentNullException();
}
if ((charIndex < 0) || (charCount < 0))
{
throw new ArgumentOutOfRangeException();
}
if ((chars.Length - charIndex) < charCount)
{
throw new ArgumentOutOfRangeException();
}
if ((byteIndex < 0) || (byteIndex > bytes.Length))
{
throw new ArgumentOutOfRangeException();
}
if ((bytes.Length - byteIndex) < charCount)
{
throw new ArgumentException();
}
int num = charIndex + charCount;
while (charIndex < num)
{
char ch = chars[charIndex++];
if (ch >= '\x0080')
{
ch = '?';
}
bytes[byteIndex++] = (byte) ch;
}
return charCount;
}
The performance characteristic of implementing a general-purpose memcpy library function using a SIMD register is significantly more colorful than an equivalent implementation using a general-purpose register...
- Intel 64 and IA-32 Architectures Optimization Reference Manual (April 2018) §3.7.6.1
For screaming speed with converting medium- to larger-sized chunks of data between 8-bit byte[]
and "wide" (16-bit, Unicode) text, you'll want to consider solutions which deploy the SIMD instructions PUNPCKLBW
+PUNPCKHBW
(widening) and PACKUSWB
(narrowing). In .NET, these are available as x64 JIT intrinstics, emitted for the hardware-accelerated System.Numerics
types Vector
and Vector<T>
(see here for more info). The generic version Vector<T>
is defined in the System.Numerics.Vectors
package, which currently remains under fairly active development. As illustrated below, you'll also probably want to include the System.Runtime.CompilerServices.Unsafe
package, since this is preferred SIMD load/store technique recommended by the Vector<T>
authors.
The relevant SIMD acceleration is only enabled for capable CPUs in x64 mode, but otherwise .NET provides transparent fallback to emulation code in the System.Numerics.Vectors
library, so the code demonstrated here does indeed reliably function across the wider .NET ecosystem, possibly with reduced performance. To test the code shown below, I used a console app on the full .NET Framework 4.8 ("desktop") in x64 (SIMD) and x86 (emulated) modes.
Since I wouldn't want to deprive anyone of the opportunity to learn the relevant techniques, I'll use Vector.Widen
to illustrate the byte[]
to char[]
direction in C# 7. From this example, doing the reverse--i.e, using Vector.Narrow
to implement the narrowing direction--is straightforward and is left as an exercise for the reader.
warning:
The methods suggested here are entirely encoding-unaware, they simply strip/expand--or narrow/widen--raw bytes to/from raw bytes without any regard for character mapping, text encoding, or other linguistic properties. When widening, surplus bytes are set to zero, and when narrowing, excess bytes are discarded.
Others have discussed the n̲u̲m̲e̲r̲o̲u̲s̲ h̲a̲z̲a̲r̲d̲s̲ associated with this practice on this page and elsewhere, so please carefully review and understand the nature of this operation before considering whether it is appropriate for your situation. For clarity, inline validation is elided from the code example shown below, but such could be added to the innermost loop with minimal impact on the SIMD benefit.
You have been warned. Although not SIMD-accelerated, canonical techniques using a suitable Encoding
instance are recommended for nearly all realistic app scenarios. Although the OP specifically requests a maximum-performance solution, first I'll summarize the proper sanctioned techniques that should normally be used instead.
To widen a byte array to a .NET
String
, invoke the GetString() method on a suitable byte-oriented encoding instance:
String Encoding.ASCII.GetString(byte[] bytes)
To narrow a .NET
String
to an (e.g., Ascii) byte array, invoke the GetBytes() method on a suitable byte-oriented encoding instance:
byte[] Encoding.ASCII.GetBytes(char[] chars)
Ok, now on to the fun part--the extremely fast SIMD-enabled ("vectorized") C# code for "dumb" widening of a byte array. As a reminder, here are some dependencies that should be referenced:
// ...
using System.Numerics; // nuget: System.Numerics.Vectors
using System.Runtime.CompilerServices; // nuget: System.Runtime.CompilerServices.Unsafe
// ...
Here is the public entry point wrapper function. If you prefer a version that returns char[]
instead of String
, it's provided at the end of this post.
/// <summary>
/// 'Widen' each byte in 'bytes' to 16-bits with no consideration for
/// character mapping or encoding.
/// </summary>
public static unsafe String ByteArrayToString(byte[] bytes)
{
// note: possible zeroing penalty; consider buffer pooling or
// other ways to allocate target?
var s = new String('\0', bytes.Length);
if (s.Length > 0)
fixed (char* dst = s)
fixed (byte* src = bytes)
widen_bytes_simd(dst, src, s.Length);
return s;
}
Next is the main working loop body. Notice the prologue loop that aligns the destination to a 16-byte memory boundary, if necessary, by bytewise copying of up to 15 source bytes. This ensures the most efficient operation of the main "quad-quadwise" loop which, with a single pairing of SIMD PUNPCKLBW/PUNPCKHBW
instructions, writes 32-bytes at once (16 source bytes are fetched and then stored as 16 wide chars occupying 32 bytes). Pre-aligning, plus the choice of dst alignment (as opposed to src) are official recommendations from the Intel manual cited above. Likewise, aligned operation entails that when the main loop completes, the source may have up to 15 residual trailing bytes; these are finished up by a short epilogue loop.
static unsafe void widen_bytes_simd(char* dst, byte* src, int c)
{
for (; c > 0 && ((long)dst & 0xF) != 0; c--)
*dst++ = (char)*src++;
for (; (c -= 0x10) >= 0; src += 0x10, dst += 0x10)
Vector.Widen(Unsafe.AsRef<Vector<byte>>(src),
out Unsafe.AsRef<Vector<ushort>>(dst + 0),
out Unsafe.AsRef<Vector<ushort>>(dst + 8));
for (c += 0x10; c > 0; c--)
*dst++ = (char)*src++;
}
That's actually all there is to it! It works like a charm and, as you'll see below, it does 'scream' as-advertised.
But first, by turning off the vs2017 debugger option "Disable JIT optimizations," we can examine the native SIMD instruction stream that the x64 JIT generates for the 'release' build on .NET 4.7.2. Here is the relevant part of the main inner loop that blasts through the data 32-bytes at a time. Notice that the JIT has managed to emit the theoretically minimal fetch/store pattern.
L_4223 mov rax,rbx
L_4226 movups xmm0,xmmword ptr [rax] ; fetch 16 bytes
L_4229 mov rax,rdi
L_422C lea rdx,[rdi+10h]
L_4230 movaps xmm2,xmm0
L_4233 pxor xmm1,xmm1
L_4237 punpcklbw xmm2,xmm1 ; interleave 8-to-16 bits (lo)
L_423B movups xmmword ptr [rax],xmm2 ; store 8 bytes (lo) to 8 wide chars (16 bytes)
L_423E pxor xmm1,xmm1
L_4242 punpckhbw xmm0,xmm1 ; interleave 8-to-16 bits (hi)
L_4246 movups xmmword ptr [rdx],xmm0 ; store 8 bytes (hi) to 8 wide chars (16 bytes)
L_4249 add rbx,10h
L_424D add rdi,20h
L_4251 add esi,0FFFFFFF0h
L_4254 test esi,esi
L_4256 jge L_4223
L_4258 ...
Performance test results:
I tested the SIMD code against four other techniques that perform the same function. For the .NET encoders listed below, this was a call to the GetChars(byte[], int, int)
method.
The testing included identical work for all and validation of identical results from all of the units under test. Test bytes were random and ASCII-only ([0x01 - 0x7F]
) in order to ensure identical result from all test units. Input size was random, maximum 1MB, with a log2 bias towards smaller sizes such that the average size was about 80K.
For fairness, execution order was systematically rotated through the 5 units for each iteration. For warmup, timings were discarded and reset to zero once, at iteration 100. The test harness does not perform any allocations during the test phase and a full GC is forced and awaited each 10000 iterations.
Relative ticks, normalized to best result .NET Framework 4.7.3056.0 x64 (release) iter naive win-1252 ascii utf-8 simd ------- ----------- ------------ ------------ ------------ ----------- 10000 | 131.5 294.5 186.2 145.6 100.0 20000 | 137.7 305.3 191.9 149.4 100.0 30000 | 139.2 308.5 195.8 151.5 100.0 40000 | 141.8 312.1 198.5 153.2 100.0 50000 | 142.0 313.8 199.1 154.1 100.0 60000 | 140.5 310.6 196.7 153.0 100.0 70000 | 141.1 312.9 197.3 153.6 100.0 80000 | 141.6 313.7 197.8 154.1 100.0 90000 | 141.3 313.7 197.9 154.3 100.0 100000 | 141.1 313.3 196.9 153.7 100.0 gcServer=False; LatencyMode.Interactive; Vector.IsHardwareAccelerated=True
On the preferred x64 platform when JIT optimization is enabled and SIMD is available, there was no contest. The SIMD code runs about 150% faster than the next contender. The Encoding.Default
, which is usually the "Windows-1252" codepage, performed particularly poorly, about 3x slower than the SIMD code.
Earlier I mentioned that the distribution of test data sizes was strongly log-biased towards zero. Without this step--meaning a uniform distribution of sizes from 0 to 1,048,576 bytes (average test size 512K)--SIMD continues to outpace the pack with all other units faring relatively worse vs. the code shown above.
naive 153.45% win-1252 358.84% ascii 221.38% utf-8 161.62% simd 100.00%
As for the non-SIMD (emulation) case, UTF-8 and SIMD are extremely close--typically within 3-4% of each other--and far better than the rest. I found this result to be doubly surprising: that the UTF8Encoding source code was so fast (lots of fast-path optimization), and then also that the general-purpose SIMD emulation code was able to match that purpose-tuned code.
/// <summary>
/// 'Widen' each byte in 'bytes' to 16-bits with no consideration for
/// character mapping or encoding
/// </summary>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static unsafe char[] WidenByteArray(byte[] bytes)
{
var rgch = new char[bytes.Length];
if (rgch.Length > 0)
fixed (char* dst = rgch)
fixed (byte* src = bytes)
widen_bytes_simd(dst, src, rgch.Length);
return rgch;
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With