i am working on a small search engine to display a matching file names with full path. and important thing is that i need to provide wildcard(GLOB) search like *.doc
or *list*.xlx
or *timesheet*
or ???.doc
or something like that.
i found some related solution
Search for strings matching the pattern "abc:*:xyz" in less than O(n)
but i am looking for efficient algorithms which can find matches out of million file names in a less than a second, so better than O(n) is required..
i am thinking of two phase algorithm with substring array (Suffix array + prefix array) search in first phase and normal RegEx search thru the results of first phase second phase.
any help would be greatly appreciated...
Check out self indexing: This Stack Overflow question, and this DrDobbs article on it.
As far as I know there is no way to do better than O(n) for generalized glob searching.
However for the special cases of prefix and suffix search you can make yourself sorted indexes to do a binary search on, resulting in O(log n) for prefix and suffix search. The prefix index would be sorted based on the first character, then the second, and so on. The suffix index would be sorted based on the last character, then the second last, and so on (the reverse strings).
I would do as you suggest in your post and do the search in two phases, search the prefix or suffix indexes, followed by a brute force search through the reduced list provided from the first phase using a regex made from the glob.
Since string length comparisons are faster than regexes, I would also pre-filter for the minimum matching string length, or fixed length matching string for the ???.doc example.
From the sound of your original post the indexes would need to refer to the full path of each entry as well, so that you can display it after the final results are found.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With