I have a requirement to (very) quickly process strings of a limited range, tallying their values. The input file is of the form: <pre class="prettyprint"><code>January 7 March 22 September 87 March 36 </code></pre> and so forth. Because the line widths are identical, I can simply read in a line with <code>fread</code> reasonably fast, and I've developed a perfect hashing function which works, but I wanted to see if anyone could offer any advice on how to make it even faster. I'll profile each suggestion to see how it goes. The hashing function is based on the month name to allow fast allocation of the value to a bucket. Bear with me here. I first figured out the minimal number of characters for a perfect hash: <pre class="prettyprint"><code>January February March April May June July August September October November December </code></pre> Keep in mind that the months are all nine characters due to the fact I have the entire input line. Unfortunately, there is no single column to mark a month unique. Column 1 duplicates <code>J</code>, column 2 duplicates <code>a</code>, column 3 duplicates <code>r</code>, column 4 duplicates <code>u</code> and columns 5 onwards duplicate <code><space></code> (there are other duplicates but one is enough to prevent a single-column hash key). However, by using the first and fourth column, I get the values <code>Ju</code>, <code>Fr</code>, <code>Mc</code>, <code>Ai</code>, <code>M<space></code>, <code>Je</code>, <code>Jy</code>, <code>Au</code>, <code>St</code>, <code>Oo</code>, <code>Ne</code> and <code>De</code>, which are unique. There will be no invalid values in this file so I don't have to worry about incorrect buckets for the input data. By viewing the hex codes for the characters, I found I could get low unique values by just ANDing with strategic values: <pre class="prettyprint"><code>FirstChar Hex Binary &0x0f --------- --- --------- ----- A x41 0100 0001 1 D x44 0100 0100 4 F x46 0100 0110 6 J x4a 0100 1010 10 M x4d 0100 1101 13 N x4e 0100 1110 14 O x4f 0100 1111 15 S x53 0101 0011 3 SecondChar Hex Binary &0x1f ---------- --- --------- ----- <space> x20 0010 0000 0 c x63 0110 0011 3 e x65 0110 0101 5 i x69 0110 1001 9 o x6f 0110 1111 15 r x72 0111 0010 18 t x74 0111 0100 20 u x75 0111 0101 21 y x79 0111 1001 25 </code></pre> and this allowed me to set up a static array to create a (hopefully) blindingly-fast hash function: <pre class="prettyprint"><code>#define __ -1 static unsigned int hash (const char *str) { static unsigned char bucket[] = { // A S D F J M N O __, __, __, __, __, __, __, __, __, __, __, __, __, 4, __, __, // space __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, // __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, // __, __, __, __, __, __, __, __, __, __, __, __, __, 2, __, __, // c __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, // __, __, __, __, 11, __, __, __, __, __, 5, __, __, __, 10, __, // e __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, // __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, // __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, // __, 3, __, __, __, __, __, __, __, __, __, __, __, __, __, __, // i __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, // __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, // __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, // __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, // __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, // __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, 9, // o __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, // __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, // __, __, __, __, __, __, 1, __, __, __, __, __, __, __, __, __, // r __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, // __, __, __, 8, __, __, __, __, __, __, __, __, __, __, __, __, // t __, 7, __, __, __, __, __, __, __, __, 0, __, __, __, __, __, // u __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, // __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, // __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, // __, __, __, __, __, __, __, __, __, __, 6, __, __, __, __, __ // y }; return bucket[((unsigned int)(str[3]&0x1f)<<4)|(str[0]&0xf)]; } </code></pre> Testing that with the code: <pre class="prettyprint"><code>#include <stdio.h> #include <string.h> // Hash function here. static char *months[] = { "January ", "February ", "March ", "April ", "May ", "June ", "July ", "August ", "September", "October ", "November ", "December " }; int main (void) { int i; for (i = 0; i < sizeof(months)/sizeof(*months); i++) printf ("%-10s -> %2d\n", months[i], hash(months[i])); return 0; } </code></pre> shows that it's functionally correct: <pre class="prettyprint"><code>January -> 0 February -> 1 March -> 2 April -> 3 May -> 4 June -> 5 July -> 6 August -> 7 September -> 8 October -> 9 November -> 10 December -> 11 </code></pre> but I want to know if it can be made faster. Any suggestions out there? I'm open to any simple optimisations or even a total rewrite if there's something inherently bad with my hashing function. <hr> I don't think this is that important but the final version will be using EBCDIC. The theory will still stand but the AND operation may change slightly since the characters have different code points. I'll be happy with any assistance only on the ASCII front since I'm confident whatever advice is offered will translate okay to EBCDIC.

Here's the smallest sequence I could find for EBCDIC-US: It has 24 elements in the bucket and uses only 2 operations to compute the index: <pre class="prettyprint"><code>static unsigned int hash (const char *str) { static unsigned char tab[] = { 11, 4,__, 7,__,__, 9, 1, __,__,__,__,__,__,__,__, 3, 5, 2,10, 8,__, 0, 6 }; return tab[0x17 & (str[ 1 ] + str[ 2 ])]; } </code></pre> Second best, 25 items with xor: <pre class="prettyprint"><code>static unsigned int hash(const char *str) { static unsigned char tab[] = { 9,__,__, 7,__,__,11, 1, __, 4,__,__,__,__, 3,__, __, 5, 8,10, 0,__,__, 6, 2 }; return tab[0x1f & (str[ 1 ] ^ str[ 2 ])]; } </code></pre> (Actually, tab[] should be 32 entries long here, because 0x1f can generate an overflow for incorrect inputs). <hr> Update from Pax: For what it's worth, the first option worked perfectly for EBCDIC code page 500: <pre class="prettyprint"><code>## Month str[1] str[2] Lookup -- --------- ------ ------ ------ 0 January a (81) n (95) 0 1 February e (85) b (82) 1 2 March a (81) r (99) 2 3 April p (97) r (99) 3 4 May a (81) y (a8) 4 5 June u (a4) n (95) 5 6 July u (a4) l (93) 6 7 August u (a4) g (87) 7 8 September e (85) p (97) 8 9 October c (83) t (a3) 9 10 November o (96) v (a5) 10 11 December e (85) c (83) 11 </code></pre>

Is there a way to make this hash lookup any faster?

Tags:

c

optimization

perfect-hash

I have a requirement to (very) quickly process strings of a limited range, tallying their values. The input file is of the form:

January    7
March     22
September 87
March     36

and so forth. Because the line widths are identical, I can simply read in a line with fread reasonably fast, and I've developed a perfect hashing function which works, but I wanted to see if anyone could offer any advice on how to make it even faster. I'll profile each suggestion to see how it goes.

The hashing function is based on the month name to allow fast allocation of the value to a bucket. Bear with me here. I first figured out the minimal number of characters for a perfect hash:

January
February
March
April
May
June
July
August
September
October
November
December

Keep in mind that the months are all nine characters due to the fact I have the entire input line.

Unfortunately, there is no single column to mark a month unique. Column 1 duplicates J, column 2 duplicates a, column 3 duplicates r, column 4 duplicates u and columns 5 onwards duplicate <space> (there are other duplicates but one is enough to prevent a single-column hash key).

However, by using the first and fourth column, I get the values Ju, Fr, Mc, Ai, M<space>, Je, Jy, Au, St, Oo, Ne and De, which are unique. There will be no invalid values in this file so I don't have to worry about incorrect buckets for the input data.

By viewing the hex codes for the characters, I found I could get low unique values by just ANDing with strategic values:

FirstChar  Hex  Binary     &0x0f
---------  ---  ---------  -----
   A       x41  0100 0001      1
   D       x44  0100 0100      4
   F       x46  0100 0110      6
   J       x4a  0100 1010     10
   M       x4d  0100 1101     13
   N       x4e  0100 1110     14
   O       x4f  0100 1111     15
   S       x53  0101 0011      3

SecondChar  Hex  Binary     &0x1f
----------  ---  ---------  -----
 <space>    x20  0010 0000      0
    c       x63  0110 0011      3
    e       x65  0110 0101      5
    i       x69  0110 1001      9
    o       x6f  0110 1111     15
    r       x72  0111 0010     18
    t       x74  0111 0100     20
    u       x75  0111 0101     21
    y       x79  0111 1001     25

and this allowed me to set up a static array to create a (hopefully) blindingly-fast hash function:

#define __ -1
static unsigned int hash (const char *str) {
    static unsigned char bucket[] = {
        //   A       S   D       F               J           M   N   O
        __, __, __, __, __, __, __, __, __, __, __, __, __,  4, __, __, // space
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, __, __, __, __, __, __, __, __, __,  2, __, __, // c
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, 11, __, __, __, __, __,  5, __, __, __, 10, __, // e
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __,  3, __, __, __, __, __, __, __, __, __, __, __, __, __, __, // i
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __,  9, // o
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, __, __,  1, __, __, __, __, __, __, __, __, __, // r
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __,  8, __, __, __, __, __, __, __, __, __, __, __, __, // t
        __,  7, __, __, __, __, __, __, __, __,  0, __, __, __, __, __, // u
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, __, __, __, __, __, __,  6, __, __, __, __, __  // y
    };
    return bucket[((unsigned int)(str[3]&0x1f)<<4)|(str[0]&0xf)];
}

Testing that with the code:

#include <stdio.h>
#include <string.h>

// Hash function here.

static char *months[] = {
    "January  ", "February ", "March    ", "April    ", "May      ", "June     ",
    "July     ", "August   ", "September", "October  ", "November ", "December "
};

int main (void) {
    int i;
    for (i = 0; i < sizeof(months)/sizeof(*months); i++)
        printf ("%-10s -> %2d\n", months[i], hash(months[i]));
    return 0;
}

shows that it's functionally correct:

January    ->  0
February   ->  1
March      ->  2
April      ->  3
May        ->  4
June       ->  5
July       ->  6
August     ->  7
September  ->  8
October    ->  9
November   -> 10
December   -> 11

but I want to know if it can be made faster.

Any suggestions out there? I'm open to any simple optimisations or even a total rewrite if there's something inherently bad with my hashing function.

I don't think this is that important but the final version will be using EBCDIC. The theory will still stand but the AND operation may change slightly since the characters have different code points. I'll be happy with any assistance only on the ASCII front since I'm confident whatever advice is offered will translate okay to EBCDIC.

663

asked Aug 06 '10 07:08

paxdiablo

2 Answers

I agree with the others that there is not much room for improvement. All I can suggest is a smaller lookup table that works with the same number of operations which might make it stay longer in the CPU cache. Additionally it does not rely on the space filling chars at the end and it works with any mixture of uppercase and lowercase characters. I found that adding some reasonable robustness against likely changes in the requirements often pays off in the future especially when the implementation is optimized to a point where small changes are not so easy anymore.

#define __ -1
static unsigned int hash (const char *str)
{
    static unsigned char tab[] = {
        __, __,  1, 11, __, __, __, __,  7, __, __, __, __,  6,  0,  5,
         8, __,  2,  3,  9, __, 10, __, __,  4, __, __, __, __, __, __
    };
    return tab[ ( ( str[ 1 ] >> 4 ) & 1 ) + ( str[ 2 ] & 0x1f ) ];
}

This works similar to your original idea but with less white space:

Month  s[1]          s[2]          s[1].4  s[2].4-0  sum  lookup
-----  ------------  ------------  ------  --------  ---  ------
Jan    61:0110 0001  6e:0110 1110       0        14   14       0
Feb    65:0110 0101  62:0110 0010       0         2    2       1
Mar    61:0110 0001  72:0111 0010       0        18   18       2
Apr    70:0111 0000  72:0111 0010       1        18   19       3
May    61:0110 0001  79:0111 1001       0        25   25       4
Jun    75:0111 0101  6e:0110 1110       1        14   15       5
Jul    75:0111 0101  6c:0110 1100       1        12   13       6
Aug    75:0111 0101  67:0110 0111       1         7    8       7
Sep    65:0110 0101  70:0111 0000       0        16   16       8
Oct    63:0110 0011  74:0111 0100       0        20   20       9
Nov    6f:0110 1111  76:0111 0110       0        22   22      10
Dec    65:0110 0101  63:0110 0111       0         3    3      11
             ^             ^ ^^^^
bits:        4             4 3210

200

answered Oct 12 '22 12:10

x4u

Here's the smallest sequence I could find for EBCDIC-US:

It has 24 elements in the bucket and uses only 2 operations to compute the index:

static unsigned int hash (const char *str)
{
 static unsigned char tab[] = {
    11, 4,__, 7,__,__, 9, 1,
    __,__,__,__,__,__,__,__,
     3, 5, 2,10, 8,__, 0, 6
 };
 return tab[0x17 & (str[ 1 ] + str[ 2 ])];
}

Second best, 25 items with xor:

static unsigned int hash(const char *str)
{
 static unsigned char tab[] = {
  9,__,__, 7,__,__,11, 1,
 __, 4,__,__,__,__, 3,__,
 __, 5, 8,10, 0,__,__, 6, 2
 };
 return tab[0x1f & (str[ 1 ] ^ str[ 2 ])];
}

(Actually, tab[] should be 32 entries long here, because 0x1f can generate an overflow for incorrect inputs).

Update from Pax: For what it's worth, the first option worked perfectly for EBCDIC code page 500:

## Month     str[1] str[2] Lookup
-- --------- ------ ------ ------
 0 January   a (81) n (95)      0
 1 February  e (85) b (82)      1
 2 March     a (81) r (99)      2
 3 April     p (97) r (99)      3
 4 May       a (81) y (a8)      4
 5 June      u (a4) n (95)      5
 6 July      u (a4) l (93)      6
 7 August    u (a4) g (87)      7
 8 September e (85) p (97)      8
 9 October   c (83) t (a3)      9
10 November  o (96) v (a5)     10
11 December  e (85) c (83)     11

answered Oct 12 '22 11:10

Nordic Mainframe

Related questions
                            
                                Threads and file descriptors
                            
                                Which guarantees does malloc make about memory alignment?
                            
                                Why does time(time_t *) function both return and set the by-ref?
                            
                                generate a random number between 1 and 10 in c
                            
                                C union type in Swift?
                            
                                How to use a C library from D?
                            
                                Declaring an array of negative length
                            
                                Is sscanf considered safe to use?
                            
                                Ways to avoid using malloc? [closed]
                            
                                Why does the C++ stdlib rand() function give different values for the same seed across platforms?
                            
                                K&R Exercise 1.16 - Limitation on line length
                            
                                How to monitor an external process for events by its PID in C?
                            
                                CUDA: Tiled matrix-matrix multiplication with shared memory and matrix size which is non-multiple of the block size
                            
                                Linking C compiled static library to C++ Program
                            
                                Compiling a .C file: Undefined symbols for architecture x86_64
                            
                                Boolean array initialization in C
                            
                                Can I return a initialized struct on one line in ANSI C?
                            
                                How does the Linux kernel handle Asynchronous I/O (AIO) requests?
                            
                                Overhead data when writing to a binary file
                            
                                What happens to fields not named by a designated initializer?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With