Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

what is the better way to search in millions of file names with wildcard(GLOB) support

i am working on a small search engine to display a matching file names with full path. and important thing is that i need to provide wildcard(GLOB) search like *.doc or *list*.xlx or *timesheet* or ???.doc or something like that.

i found some related solution

Search for strings matching the pattern "abc:*:xyz" in less than O(n)

but i am looking for efficient algorithms which can find matches out of million file names in a less than a second, so better than O(n) is required..

i am thinking of two phase algorithm with substring array (Suffix array + prefix array) search in first phase and normal RegEx search thru the results of first phase second phase.

any help would be greatly appreciated...

like image 668
Mahes Avatar asked Feb 07 '10 02:02

Mahes


2 Answers

Check out self indexing: This Stack Overflow question, and this DrDobbs article on it.

like image 113
Nick Johnson Avatar answered Sep 29 '22 06:09

Nick Johnson


As far as I know there is no way to do better than O(n) for generalized glob searching.

However for the special cases of prefix and suffix search you can make yourself sorted indexes to do a binary search on, resulting in O(log n) for prefix and suffix search. The prefix index would be sorted based on the first character, then the second, and so on. The suffix index would be sorted based on the last character, then the second last, and so on (the reverse strings).

I would do as you suggest in your post and do the search in two phases, search the prefix or suffix indexes, followed by a brute force search through the reduced list provided from the first phase using a regex made from the glob.

Since string length comparisons are faster than regexes, I would also pre-filter for the minimum matching string length, or fixed length matching string for the ???.doc example.

From the sound of your original post the indexes would need to refer to the full path of each entry as well, so that you can display it after the final results are found.

like image 23
Sarah Happy Avatar answered Sep 29 '22 07:09

Sarah Happy