Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently check string for one of several hundred possible suffixes

I need to write a C/C++ function that would quickly check if string ends with one of ~1000 predefined suffixes. Specifically the string is a hostname and I need to check if it belongs to one of several hundred predefined second-level domains.

This function will be called a lot so it needs to be written as efficiently as possible. Bitwise hacks etc anything goes as long as it turns out fast.

Set of suffixes is predetermined at compile-time and doesn't change.

I am thinking of either implementing a variation of Rabin-Karp or write a tool that would generate a function with nested ifs and switches that would be custom tailored to specific set of suffixes. Since the application in question is 64-bit to speed up comparisons I could store suffixes of up to 8 bytes in length as const sorted array and do binary search within it.

Are there any other reasonable options?

like image 272
Ghostrider Avatar asked Apr 04 '10 17:04

Ghostrider


2 Answers

If the suffixes don't contain any expansions/rules (like a regex), you could build a Trie of the suffixes in reverse order, and then match the string based on that. For instance

suffixes:
  foo
  bar
  bao

reverse order suffix trie:
  o
   -a-b  (matches bao)
   -o-f  (matches foo)
  r-a-b  (matches bar)

These can then be used to match your string:

"mystringfoo" -> reverse -> "oofgnirtsym" -> trie match -> foo suffix
like image 200
James Kolpack Avatar answered Sep 22 '22 23:09

James Kolpack


You mention that you're looking at second-level domain names only, so even without knowing the precise set of matching domains, you could extract the relevant portion of the input string.

Then simply use a hashtable. Dimension it in such a way that there are no collisions, so you don't need buckets; lookups will be exactly O(1). For small hash types (e.g. 32 bits), you'd want to check if the strings really match. For a 64-bit hash, the probability of another domain colliding with one of the hashes in your table is already so low (order 10^-17) that you can probably live with it.

like image 24
Thomas Avatar answered Sep 23 '22 23:09

Thomas