Assuming something like:
void mask_bytes(unsigned char* dest, unsigned char* src, unsigned char* mask, unsigned int len)
{
unsigned int i;
for(i=0; i<len; i++)
{
dest[i] = src[i] & mask[i];
}
}
I can go faster on a non-aligned access machine (e.g. x86) by writing something like:
void mask_bytes(unsigned char* dest, unsigned char* src, unsigned char* mask, unsigned int len)
{
unsigned int i;
unsigned int wordlen = len >> 2;
for(i=0; i<wordlen; i++)
{
((uint32_t*)dest)[i] = ((uint32_t*)src)[i] & ((uint32_t*)mask)[i]; // this raises SIGBUS on SPARC and other archs that require aligned access.
}
for(i=wordlen<<2; i<len; i++){
dest[i] = src[i] & mask[i];
}
}
However it needs to build on several architectures so I would like to do something like:
void mask_bytes(unsigned char* dest, unsigned char* src, unsigned char* mask, unsigned int len)
{
unsigned int i;
unsigned int wordlen = len >> 2;
#if defined(__ALIGNED2__) || defined(__ALIGNED4__) || defined(__ALIGNED8__)
// go slow
for(i=0; i<len; i++)
{
dest[i] = src[i] & mask[i];
}
#else
// go fast
for(i=0; i<wordlen; i++)
{
// the following line will raise SIGBUS on SPARC and other archs that require aligned access.
((uint32_t*)dest)[i] = ((uint32_t*)src)[i] & ((uint32_t*)mask)[i];
}
for(i=wordlen<<2; i<len; i++){
dest[i] = src[i] & mask[i];
}
#endif
}
But I cannot find any good information on compiler defined macros (like my hypothetical __ALIGNED4__
above) that specify alignment or any clever ways of using the pre-processor to determine target architecture alignment. I could just test defined (__SVR4) && defined (__sun)
, but I would prefer something that will Just WorkTM on other architectures requiring aligned memory accesses.
A memory access is said to be aligned when the data being accessed is n bytes long and the datum address is n-byte aligned. When a memory access is not aligned, it is said to be misaligned. Note that by definition byte memory accesses are always aligned.
By default, ARM7 and ARM9 based microcontrollers do not allow un-aligned accesses to 16-bit and 32-bit data types. Cortex-M3 supports even un-aligned accesses, so the program above would behave correctly.
The CPU can operate on an aligned word of memory atomically, meaning that no other instruction can interrupt that operation. This is critical to the correct operation of many lock-free data structures and other concurrency paradigms.
An unaligned address is then an address that isn't a multiple of the transfer size. The meaning in AXI4 would be the same.
While x86 silently fixes up unaligned accesses, this is hardly optimal for performance. It is usually best to assume a certain alignment and perform fixups yourself:
unsigned int const alignment = 8; /* or 16, or sizeof(long) */
void memcpy(char *dst, char const *src, unsigned int size) {
if((((intptr_t)dst) % alignment) != (((intptr_t)src) % alignment)) {
/* no common alignment, copy as bytes or shift around */
} else {
if(((intptr_t)dst) % alignment) {
/* copy bytes at the beginning */
}
/* copy words in the middle */
if(((intptr_t)dst + size) % alignment) {
/* copy bytes at the end */
}
}
}
Also, take a look at SIMD instructions.
The standard approach would be to have a configure
script that runs a program to test for alignment issues. If the test program doesn't crash, the configure script defines a macro in a generated config header that allows for the faster implementation. The safer implementation is the default.
void mask_bytes(unsigned char* dest, unsigned char* src, unsigned char* mask, unsigned int len)
{
unsigned int i;
unsigned int wordlen = len >> 2;
#if defined(UNALIGNED)
// go fast
for(i=0; i<wordlen; i++)
{
// the following line will raise SIGBUS on SPARC and other archs that require aligned access.
((uint32_t*)dest)[i] = ((uint32_t*)src)[i] & ((uint32_t*)mask)[i];
}
for(i=wordlen<<2; i<len; i++){
dest[i] = src[i] & mask[i];
}
#else
// go slow
for(i=0; i<len; i++)
{
dest[i] = src[i] & mask[i];
}
#endif
}
(I find it weird that you have src
and mask
when really these commute. I renamed mask_bytes
to memand
. But anyways...)
Another options is to use different functions that take advantage of types in C. For instance:
void memand_bytes(char *dest, char *src1, char *src2, size_t len)
{
unsigned int i;
for (i = 0; i < len; i++)
dest[i] = src1[i] & src2[i];
}
void memand_ints(int *dest, int *src1, int *src2, size_t len)
{
unsigned int i;
for (i = 0; i < len; i++)
dest[i] = src1[i] & src2[i];
}
This way you let the programmer decide.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With