I'm implementing an algorithm (SpookyHash) that treats arbitrary data as 64-bit integers, by casting the pointer to (ulong*)
. (This is inherent to how SpookyHash works, rewriting to not do so is not a viable solution).
This means that it could end up reading 64-bit values that are not aligned on 8-byte boundaries.
On some CPUs, this works fine. On some, it would be very slow. On yet others, it would cause errors (either exceptions or incorrect results).
I therefore have code to detect unaligned reads, and copy chunks of data to 8-byte aligned buffers when necessary, before working on them.
However, my own machine has an Intel x86-64. This tolerates unaligned reads well enough that it gives much faster performance if I just ignore the issue of alignment, as does x86. It also allows for memcpy
-like and memzero
-like methods to deal in 64-byte chunks for another boost. These two performance improvements are considerable, more than enough of a boost to make such an optimisation far from premature.
So. I've an optimisation that is well worth making on some chips (and for that matter, probably the two chips most likely to have this code run on them), but would be fatal or give worse performance on others. Clearly the ideal is to detect which case I am dealing with.
Some further requirements:
This is intended to be a cross-platform library for all systems that support .NET or Mono. Therefore anything specific to a given OS (e.g. P/Invoking to an OS call) is not appropriate, unless it can safely degrade in the face of the call not being available.
False negatives (identifying a chip as unsafe for the optimisation when it is in fact safe) are tolerable, false positives are not.
Expensive operations are fine, as long as they can be done once, and then the result cached.
The library already uses unsafe code, so there's no need to avoid that.
So far I have two approaches:
The first is to initialise my flag with:
private static bool AttemptDetectAllowUnalignedRead()
{
switch(Environment.GetEnvironmentVariable("PROCESSOR_ARCHITECTURE"))
{
case "x86": case "AMD64": // Known to tolerate unaligned-reads well.
return true;
}
return false; // Not known to tolerate unaligned-reads well.
}
The other is that since the buffer copying necessary for avoiding unaligned reads is created using stackalloc
, and since on x86 (including AMD64 in 32-bit mode), stackalloc
ing a 64-bit type may sometimes return a pointer that is 4-byte aligned but not 8-byte aligned, I can then tell at that point that the alignment workaround isn't needed, and never attempt it again:
if(!AllowUnalignedRead && length != 0 && (((long)message) & 7) != 0) // Need to avoid unaligned reads.
{
ulong* buf = stackalloc ulong[2 * NumVars]; // buffer to copy into.
if((7 & (long)buf) != 0) // Not 8-byte aligned, so clearly this was unnecessary.
{
AllowUnalignedRead = true;
Thread.MemoryBarrier(); //volatile write
This latter though will only work on 32-bit execution (even if unaligned 64-bit reads are tolerated, no good implementation of stackalloc
would force them on a 64-bit processor). It also could potentially give a false positive in that the processor might insist on 4-byte alignment, which would have the same issue.
Any ideas for improvements, or better yet, an approach that gives no false negatives like the two approaches above?
Specify an alignment that's a power of two, such as 1, 2, 4, 8, 16, and so on. Don't use a value smaller than the size of the type. struct and union types have an alignment equal to the largest alignment of any member. Padding bytes are added within a struct to ensure individual member alignment requirements are met.
By default, the compiler aligns data based on its size: char on a 1-byte boundary, short on a 2-byte boundary, int, long, and float on a 4-byte boundary, double on 8-byte boundary, and so on. Additionally, by aligning frequently used data with the processor's cache line size, you can improve cache performance.
The align modifier overrides this default and aligns the target memory at the requested alignment. To get the benefits of fast data transfer and the necessary alignment on the target, ensure that the processor data is aligned on the same boundary as the alignment specified in the align modifier.
For best performance, we align lda and ldc to the 32-byte memory alignment requirement of AVX2. This alignment ensures that the load performance inside the GEMM microkernel is not lowered by unaligned loads which might suffer from cacheline splits (vector loads and stores of data spanning across cacheline boundaries).
Well, here is my own final-for-now answer. While I'm answering my own question here, I owe a lot to the comments.
Ben Voigt and J Trana's comments made me realise something. While my specific question is a boolean one, the general question is not:
Pretty much all modern processors have a performance hit for unaligned reads, it's just that with some that hit is so slight as to be insignificant compared to the cost of avoiding it.
As such, there really isn't an answer to the question, "which processors allow unaligned reads cheaply enough?" but rather, "which processors allow unaligned reads cheaply enough for my current situation. As such, any fully consistent and reliable method isn't just impossible, but as a question unrelated to a particular case, meaningless.
And as such, white-listing cases known to be good enough for the code at hand, is the only way to go.
It's to Stu though that I owe managing to get my success with Mono on *nix up to that I was having with .NET and Mono on Windows. The discussion in the comments above brought my train of thought to a relatively simple, but reasonably effective, approach (and if Stu posts an answer with "I think you should base your approach on having platform-specific code run safely", I'll accept it, because that was the crux of one of his suggestions, and the key to what I've done).
As before I first try checking an environment variable that will generally be set in Windows, and not set on any other OS.
If that fails, I try to run uname -p
and parse the results. That can fail for a variety of reasons (not running on *nix, not having sufficient permissions, running on one of the forms of *nix that has a uname
command but no -p
flag). With any exception, I just eat the exception, and then try uname -m
, which his more widely available, but has a greater variety of labels for the same chips.
And if that fails, I just eat any exception again, and consider it a case of my white-list not having been satisfied: I can get false negatives which will mean sub-optimal performance, but not false positives resulting in error. I can also add to the white-list easily enough if I learn a given family of chips is similarly better off with the code-branch that doesn't try to avoid unaligned reads.
The current code looks like:
[SuppressMessage("Microsoft.Design", "CA1031:DoNotCatchGeneralExceptionTypes",
Justification = "Many exceptions possible, all of them survivable.")]
[ExcludeFromCodeCoverage]
private static bool AttemptDetectAllowUnalignedRead()
{
switch(Environment.GetEnvironmentVariable("PROCESSOR_ARCHITECTURE"))
{
case "x86":
case "AMD64": // Known to tolerate unaligned-reads well.
return true;
}
// Analysis disable EmptyGeneralCatchClause
try
{
return FindAlignSafetyFromUname();
}
catch
{
return false;
}
}
[SecuritySafeCritical]
[SuppressMessage("Microsoft.Design", "CA1031:DoNotCatchGeneralExceptionTypes",
Justification = "Many exceptions possible, all of them survivable.")]
[ExcludeFromCodeCoverage]
private static bool FindAlignSafetyFromUname()
{
var startInfo = new ProcessStartInfo("uname", "-p");
startInfo.CreateNoWindow = true;
startInfo.ErrorDialog = false;
startInfo.LoadUserProfile = false;
startInfo.RedirectStandardOutput = true;
startInfo.UseShellExecute = false;
try
{
var proc = new Process();
proc.StartInfo = startInfo;
proc.Start();
using(var output = proc.StandardOutput)
{
string line = output.ReadLine();
if(line != null)
{
string trimmed = line.Trim();
if(trimmed.Length != 0)
switch(trimmed)
{
case "amd64":
case "i386":
case "x86_64":
case "x64":
return true; // Known to tolerate unaligned-reads well.
}
}
}
}
catch
{
// We don't care why we failed, as there are many possible reasons, and they all amount
// to our not having an answer. Just eat the exception.
}
startInfo.Arguments = "-m";
try
{
var proc = new Process();
proc.StartInfo = startInfo;
proc.Start();
using(var output = proc.StandardOutput)
{
string line = output.ReadLine();
if(line != null)
{
string trimmed = line.Trim();
if(trimmed.Length != 0)
switch(trimmed)
{
case "amd64":
case "i386":
case "i686":
case "i686-64":
case "i86pc":
case "x86_64":
case "x64":
return true; // Known to tolerate unaligned-reads well.
default:
if(trimmed.Contains("i686") || trimmed.Contains("i386"))
return true;
return false;
}
}
}
}
catch
{
// Again, just eat the exception.
}
// Analysis restore EmptyGeneralCatchClause
return false;
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With