Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to work with unaligned data on a word-aligned processor?

I'm doing a project on an ARM Cortex M0, which does not support unaligned(by 4bytes) access, and I'm trying to optimize the speed of operations on unaligned data.

I'm storing Bluetooth Low Energy Access addresses (48bit) as 6-byte arrays in some packed structs acting as packet buffers. Because of the packing, the BLE addresses are not necessarily starting at a word aligned address, and I'm running into some complications when optimizing my access functions to these addresses.

The first, and most obvious approach is a for loop operating on each byte in the array individually. Checking if two addresses are the same could for instance be done like this:

uint8_t ble_adv_addr_is_equal(uint8_t* addr1, uint8_t* addr2)
{
  for (uint32_t i = 0; i < 6; ++i)
  {
    if (addr1[i] != addr2[i])
      return 0;
  }
  return 1;
}

I'm doing a lot of comparisons in my project, and I wanted to see if I could squeeze some more speed out of this function. I realised that for aligned addresses, I could cast them to uint64_t, and compare with 48 bit masks applied, i.e.

((uint64_t)&addr1[0] & 0xFFFFFFFFFFFF) == ((uint64_t)&addr2[0] & 0xFFFFFFFFFFFF)

Similar operations could be done for writing, and it works well for aligned versions. However, since my addresses aren't always word-aligned (or even half-word), I would have to do some extra tricks to make this work.

First off, I came up with this unoptimized nightmare of a compiler macro:

#define ADDR_ALIGNED(_addr) (uint64_t)(((*((uint64_t*)(((uint32_t)_addr) & ~0x03)) >> (8*(((uint32_t)_addr) & 0x03))) & 0x000000FFFFFFFF)\
                                    | (((*((uint64_t*)(((uint32_t)_addr+4) & ~0x03))) << (32-8*(((uint32_t)_addr) & 0x03)))) & 0x00FFFF00000000)

It basically shifts the entire address to start at the previous word aligned memory position, regardless of offset. For instance:

    0       1       2       3
|-------|-------|-------|-------|
|.......|.......|.......|<ADDR0>|
|<ADDR1>|<ADDR2>|<ADDR3>|<ADDR4>|
|<ADDR5>|.......|.......|.......|

becomes

    0       1       2       3
|-------|-------|-------|-------|
|<ADDR0>|<ADDR1>|<ADDR2>|<ADDR3>|
|<ADDR4>|<ADDR5>|.......|.......|
|.......|.......|.......|.......|

and I can safely do a 64-bit comparison of two addresses, regardless of their actual alignment:

ADDR_ALIGNED(addr1) == ADDR_ALIGNED(addr2)

Neat! But this operation takes 71 lines of assembly when compiled with the ARM-MDK, compared to 53 when doing the comparison in a simple for loop (I'm just going to disregard the additional time spent in the branch instructions here), and ~30 when unrolled. Also, it doesn't work for writes, as the alignment only happens in the registers, not in memory. Unaligning it again would require a similar operation, and the whole approach generally seems to suck.

Is an unrolled for-loop working each byte individually really the fastest solution for cases like this? Does anyone have any experience with similar scenarios, and feel like sharing some of their wizardry here?

like image 564
dromtrund Avatar asked Apr 24 '15 20:04

dromtrund


People also ask

Why are aligned memory access faster?

Alignment helps the CPU fetch data from memory in an efficient manner: less cache miss/flush, less bus transactions etc. Some memory types (e.g. RDRAM, DRAM etc.) need to be accessed in a structured manner (aligned "words" and in "burst transactions" i.e. many words at one time) in order to yield efficient results.

What is the difference between word aligned access and unaligned access of data in memory?

A memory access is said to be aligned when the data being accessed is n bytes long and the datum address is n-byte aligned. When a memory access is not aligned, it is said to be misaligned. Note that by definition byte memory accesses are always aligned.

What is unaligned data access?

Unaligned memory accesses occur when you try to read N bytes of data starting from an address that is not evenly divisible by N (i.e. addr % N != 0). For example, reading 4 bytes of data from address 0x10004 is fine, but reading 4 bytes of data from address 0x10005 would be an unaligned memory access.

What is word alignment in computer?

Alignment determines the appearance and orientation of the edges of the paragraph: left-aligned text, right-aligned text, centered text, or justified text, which is aligned evenly along the left and right margins.


2 Answers

UPDATE

Ok, because your data has no alignment whatsover, you need to either read all the data in, byte by byte, into properly aligned buffers and then do really fast 64-bit compares, or, if you won't be using the data after the compares, just read in the data as bytes and do 6 compares, in which case calling memcmp() might be a better option.


For at least 16-bit aligned:


 u16 *src1 = (u16 *)addr1; 
 u16 *src2 = (u16 *)addr2;

 for (int i = 0; i &lt 3; ++i)
 {
    if (src1[i] != src2[i])
      return 0;
 }

 return 1;

Will be twice as fast as byte comparisons and might be the best you can reasonably do as long as your data is at least 2-byte aligned. I'd also expect the compiler to remove the for loop completely and just use conditionally executed if statements instead.

Trying for 32-bit aligned reads will not be faster unless you can guarentee that source1 and 2 are similiarly aligned (add1 & 0x03) == (addr2 & 0x03). If this is the case, then you can read in a 32-bit value and then a 16-bit (or visa-versa, depending on starting alignment) and remove 1 more compare.

As 16-bit is your shared base, you can start there and the compiler should generate nice ldrh type opcodes.

like image 189
Michael Dorgan Avatar answered Sep 20 '22 22:09

Michael Dorgan


You might get your compiler to choose the fastest way for you:

#include <stdint.h>
#include <stddef.h>
#include <string.h>

uint64_t unalignedload(char const *packed)
{
  uint64_t buffer;
  memcpy(&buffer, packed, 8);
  return buffer;
}

This is not quite what you want, as loading 8 bytes might segfault if you're unaligned and run off the page, but it's a start. If you can add two bytes of padding to the end of the array, you can easily avoid that problem.
gcc and clang seem to optimize this well.

like image 37
EOF Avatar answered Sep 23 '22 22:09

EOF