Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do we achieve "substring-match" under O(n) time?

I have an assignment that requires reading a huge file of random inputs, for example:

Adana 
Izmir Adnan Menderes Apt
Addis Ababa
Aden
ADIYAMAN
ALDAN
Amman Marka Intl Airport
Adak Island
Adelaide Airport
ANURADHAPURA
Kodiak Apt
DALLAS/ADDISON
Ardabil 
ANDREWS AFB
etc..

If I specify a search term, the program is supposed to find the lines whereby a substring occurs. For example, if the search term is "uradha", the program is supposed to show ANURADHAPURA. If the search term is "airport", the program is supposed to show Amman Marka Intl Airport, Adelaide Airport

A quote from the assignment specs: "You are to program this application taking efficiency into account as though large amounts of data and processing is involved.."

I could easily achieve this functionality using a loop but the performance would be O(n). I was thinking of using a trie but it seems to only work if the substring starts from index 0.

I was wondering what solutions are there which gives a performance better than O(n)?

like image 254
Pacerier Avatar asked Nov 17 '11 09:11

Pacerier


4 Answers

Here's one with the O(n) as the worst case time complexity.

By the way, you should bookmark this link: http://www-igm.univ-mlv.fr/~lecroq/string/

like image 119
nullpotent Avatar answered Oct 19 '22 22:10

nullpotent


You could take a look at the Boyer-Moore string search algorithm or the Knuth-Morris-Pratt string search algorithm. They have good asymptotic performance, but I don't know of an algorithm that would not require to at least read once (almost all of) both the input and output string, and thus would have better than O(n) performance (where n is the size of the input).

like image 25
Sylvain Defresne Avatar answered Oct 19 '22 20:10

Sylvain Defresne


My gut says you are on the right track thinking of a trie and you may want to examine this section of the trie page on Wikipedia that links to Suffix Tree for some more ideas. O(n) ideas unfortunately.

like image 25
David Wheaton Avatar answered Oct 19 '22 20:10

David Wheaton


It the input text has almost static content (or values are added not so often, and values are added to the end of input source), but searching is often you can try following (probably the same as trie)

1) You'll read all text (and also update then new element is added) and prepare indexes table (map of symbol to coordinate (line or line with position) where match occurs)

'aa' - 1, 15, 27... 
'as' - 1, 15, 17...
'ba' - 2, 3, 15...
...

2) First search coordinate in index table by first 2 symbols

3) Then continue search in input text by coordinates

like image 29
Vitaliy Avatar answered Oct 19 '22 21:10

Vitaliy