Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Quickly find whether a value is present in a C array?

People also ask

How do you check if a value exists in an array in C?

To check if given Array contains a specified element in C programming, iterate over the elements of array, and during each iteration check if this element is equal to the element we are searching for.

How do you check if a value is present in an array?

JavaScript Array includes() The includes() method returns true if an array contains a specified value. The includes() method returns false if the value is not found. The includes() method is case sensitive.

How do I find an element in an array of arrays?

To find the position of an element in an array, you use the indexOf() method. This method returns the index of the first occurrence the element that you want to find, or -1 if the element is not found. The following illustrates the syntax of the indexOf() method.


In situations where performance is of utmost importance, the C compiler will most likely not produce the fastest code compared to what you can do with hand tuned assembly language. I tend to take the path of least resistance - for small routines like this, I just write asm code and have a good idea how many cycles it will take to execute. You may be able to fiddle with the C code and get the compiler to generate good output, but you may end up wasting lots of time tuning the output that way. Compilers (especially from Microsoft) have come a long way in the last few years, but they are still not as smart as the compiler between your ears because you're working on your specific situation and not just a general case. The compiler may not make use of certain instructions (e.g. LDM) that can speed this up, and it's unlikely to be smart enough to unroll the loop. Here's a way to do it which incorporates the 3 ideas I mentioned in my comment: Loop unrolling, cache prefetch and making use of the multiple load (ldm) instruction. The instruction cycle count comes out to about 3 clocks per array element, but this doesn't take into account memory delays.

Theory of operation: ARM's CPU design executes most instructions in one clock cycle, but the instructions are executed in a pipeline. C compilers will try to eliminate the pipeline delays by interleaving other instructions in between. When presented with a tight loop like the original C code, the compiler will have a hard time hiding the delays because the value read from memory must be immediately compared. My code below alternates between 2 sets of 4 registers to significantly reduce the delays of the memory itself and the pipeline fetching the data. In general, when working with large data sets and your code doesn't make use of most or all of the available registers, then you're not getting maximum performance.

; r0 = count, r1 = source ptr, r2 = comparison value

   stmfd sp!,{r4-r11}   ; save non-volatile registers
   mov r3,r0,LSR #3     ; loop count = total count / 8
   pld [r1,#128]
   ldmia r1!,{r4-r7}    ; pre load first set
loop_top:
   pld [r1,#128]
   ldmia r1!,{r8-r11}   ; pre load second set
   cmp r4,r2            ; search for match
   cmpne r5,r2          ; use conditional execution to avoid extra branch instructions
   cmpne r6,r2
   cmpne r7,r2
   beq found_it
   ldmia r1!,{r4-r7}    ; use 2 sets of registers to hide load delays
   cmp r8,r2
   cmpne r9,r2
   cmpne r10,r2
   cmpne r11,r2
   beq found_it
   subs r3,r3,#1        ; decrement loop count
   bne loop_top
   mov r0,#0            ; return value = false (not found)
   ldmia sp!,{r4-r11}   ; restore non-volatile registers
   bx lr                ; return
found_it:
   mov r0,#1            ; return true
   ldmia sp!,{r4-r11}
   bx lr

Update: There are a lot of skeptics in the comments who think that my experience is anecdotal/worthless and require proof. I used GCC 4.8 (from the Android NDK 9C) to generate the following output with optimization -O2 (all optimizations turned on including loop unrolling). I compiled the original C code presented in the question above. Here's what GCC produced:

.L9: cmp r3, r0
     beq .L8
.L3: ldr r2, [r3, #4]!
     cmp r2, r1
     bne .L9
     mov r0, #1
.L2: add sp, sp, #1024
     bx  lr
.L8: mov r0, #0
     b .L2

GCC's output not only doesn't unroll the loop, but also wastes a clock on a stall after the LDR. It requires at least 8 clocks per array element. It does a good job of using the address to know when to exit the loop, but all of the magical things compilers are capable of doing are nowhere to be found in this code. I haven't run the code on the target platform (I don't own one), but anyone experienced in ARM code performance can see that my code is faster.

Update 2: I gave Microsoft's Visual Studio 2013 SP2 a chance to do better with the code. It was able to use NEON instructions to vectorize my array initialization, but the linear value search as written by the OP came out similar to what GCC generated (I renamed the labels to make it more readable):

loop_top:
   ldr  r3,[r1],#4  
   cmp  r3,r2  
   beq  true_exit
   subs r0,r0,#1 
   bne  loop_top
false_exit: xxx
   bx   lr
true_exit: xxx
   bx   lr

As I said, I don't own the OP's exact hardware, but I will be testing the performance on an nVidia Tegra 3 and Tegra 4 of the 3 different versions and post the results here soon.

Update 3: I ran my code and Microsoft's compiled ARM code on a Tegra 3 and Tegra 4 (Surface RT, Surface RT 2). I ran 1000000 iterations of a loop which fails to find a match so that everything is in cache and it's easy to measure.

             My Code       MS Code
Surface RT    297ns         562ns
Surface RT 2  172ns         296ns  

In both cases my code runs almost twice as fast. Most modern ARM CPUs will probably give similar results.


There's a trick for optimizing it (I was asked this on a job-interview once):

  • If the last entry in the array holds the value that you're looking for, then return true
  • Write the value that you're looking for into the last entry in the array
  • Iterate the array until you encounter the value that you're looking for
  • If you've encountered it before the last entry in the array, then return true
  • Return false

bool check(uint32_t theArray[], uint32_t compareVal)
{
    uint32_t i;
    uint32_t x = theArray[SIZE-1];
    if (x == compareVal)
        return true;
    theArray[SIZE-1] = compareVal;
    for (i = 0; theArray[i] != compareVal; i++);
    theArray[SIZE-1] = x;
    return i != SIZE-1;
}

This yields one branch per iteration instead of two branches per iteration.


UPDATE:

If you're allowed to allocate the array to SIZE+1, then you can get rid of the "last entry swapping" part:

bool check(uint32_t theArray[], uint32_t compareVal)
{
    uint32_t i;
    theArray[SIZE] = compareVal;
    for (i = 0; theArray[i] != compareVal; i++);
    return i != SIZE;
}

You can also get rid of the additional arithmetic embedded in theArray[i], using the following instead:

bool check(uint32_t theArray[], uint32_t compareVal)
{
    uint32_t *arrayPtr;
    theArray[SIZE] = compareVal;
    for (arrayPtr = theArray; *arrayPtr != compareVal; arrayPtr++);
    return arrayPtr != theArray+SIZE;
}

If the compiler doesn't already apply it, then this function will do so for sure. On the other hand, it might make it harder on the optimizer to unroll the loop, so you will have to verify that in the generated assembly code...


Keep the table in sorted order, and use Bentley's unrolled binary search:

i = 0;
if (key >= a[i+512]) i += 512;
if (key >= a[i+256]) i += 256;
if (key >= a[i+128]) i += 128;
if (key >= a[i+ 64]) i +=  64;
if (key >= a[i+ 32]) i +=  32;
if (key >= a[i+ 16]) i +=  16;
if (key >= a[i+  8]) i +=   8;
if (key >= a[i+  4]) i +=   4;
if (key >= a[i+  2]) i +=   2;
if (key >= a[i+  1]) i +=   1;
return (key == a[i]);

The point is,

  • if you know how big the table is, then you know how many iterations there will be, so you can fully unroll it.
  • Then, there's no point testing for the == case on each iteration because, except on the last iteration, the probability of that case is too low to justify spending time testing for it.**
  • Finally, by expanding the table to a power of 2, you add at most one comparison, and at most a factor of two storage.

** If you're not used to thinking in terms of probabilities, every decision point has an entropy, which is the average information you learn by executing it. For the >= tests, the probability of each branch is about 0.5, and -log2(0.5) is 1, so that means if you take one branch you learn 1 bit, and if you take the other branch you learn one bit, and the average is just the sum of what you learn on each branch times the probability of that branch. So 1*0.5 + 1*0.5 = 1, so the entropy of the >= test is 1. Since you have 10 bits to learn, it takes 10 branches. That's why it's fast!

On the other hand, what if your first test is if (key == a[i+512)? The probability of being true is 1/1024, while the probability of false is 1023/1024. So if it's true you learn all 10 bits! But if it's false you learn -log2(1023/1024) = .00141 bits, practically nothing! So the average amount you learn from that test is 10/1024 + .00141*1023/1024 = .0098 + .00141 = .0112 bits. About one hundredth of a bit. That test is not carrying its weight!


You're asking for help with optimising your algorithm, which may push you to assembler. But your algorithm (a linear search) is not so clever, so you should consider changing your algorithm. E.g.:

  • perfect hash function
  • binary search

Perfect hash function

If your 256 "valid" values are static and known at compile time, then you can use a perfect hash function. You need to find a hash function that maps your input value to a value in the range 0..n, where there are no collisions for all the valid values you care about. That is, no two "valid" values hash to the same output value. When searching for a good hash function, you aim to:

  • Keep the hash function reasonably fast.
  • Minimise n. The smallest you can get is 256 (minimal perfect hash function), but that's probably hard to achieve, depending on the data.

Note for efficient hash functions, n is often a power of 2, which is equivalent to a bitwise mask of low bits (AND operation). Example hash functions:

  • CRC of input bytes, modulo n.
  • ((x << i) ^ (x >> j) ^ (x << k) ^ ...) % n (picking as many i, j, k, ... as needed, with left or right shifts)

Then you make a fixed table of n entries, where the hash maps the input values to an index i into the table. For valid values, table entry i contains the valid value. For all other table entries, ensure that each entry of index i contains some other invalid value which doesn't hash to i.

Then in your interrupt routine, with input x:

  1. Hash x to index i (which is in the range 0..n)
  2. Look up entry i in the table and see if it contains the value x.

This will be much faster than a linear search of 256 or 1024 values.

I've written some Python code to find reasonable hash functions.

Binary search

If you sort your array of 256 "valid" values, then you can do a binary search, rather than a linear search. That means you should be able to search 256-entry table in only 8 steps (log2(256)), or a 1024-entry table in 10 steps. Again, this will be much faster than a linear search of 256 or 1024 values.


If the set of constants in your table is known in advance, you can use perfect hashing to ensure that only one access is made to the table. Perfect hashing determines a hash function that maps every interesting key to a unique slot (that table isn't always dense, but you can decide how un-dense a table you can afford, with less dense tables typically leading to simpler hashing functions).

Usually, the perfect hash function for the specific set of keys is relatively easy to compute; you don't want that to be long and complicated because that competes for time perhaps better spent doing multiple probes.

Perfect hashing is a "1-probe max" scheme. One can generalize the idea, with the thought that one should trade simplicity of computing the hash code with the time it takes to make k probes. After all, the goal is "least total time to look up", not fewest probes or simplest hash function. However, I've never seen anybody build a k-probes-max hashing algorithm. I suspect one can do it, but that's likely research.

One other thought: if your processor is extremely fast, the one probe to memory from a perfect hash probably dominates the execution time. If the processor is not very fast, than k>1 probes might be practical.


Use a hash set. It will give O(1) lookup time.

The following code assumes that you can reserve value 0 as an 'empty' value, i.e. not occurring in actual data. The solution can be expanded for a situation where this is not the case.

#define HASH(x) (((x >> 16) ^ x) & 1023)
#define HASH_LEN 1024
uint32_t my_hash[HASH_LEN];

int lookup(uint32_t value)
{
    int i = HASH(value);
    while (my_hash[i] != 0 && my_hash[i] != value) i = (i + 1) % HASH_LEN;
    return i;
}

void store(uint32_t value)
{
    int i = lookup(value);
    if (my_hash[i] == 0)
       my_hash[i] = value;
}

bool contains(uint32_t value)
{
    return (my_hash[lookup(value)] == value);
}

In this example implementation, the lookup time will typically be very low, but at the worst case can be up to the number of entries stored. For a realtime application, you can consider also an implementation using binary trees, which will have a more predictable lookup time.


In this case, it might be worthwhile investigating Bloom filters. They're capable of quickly establishing that a value is not present, which is a good thing since most of the 2^32 possible values are not in that 1024 element array. However, there are some false positives that will need an extra check.

Since your table is apparently static, you can determine which false positives exist for your Bloom filter and put those in a perfect hash.