This was an interview question and concerns about the efficiency. When there is a very large file (in GB) something like a log file. How can we find the 10th occurrence of a word like 'error' or 'java' etc. from the end of file. I can only think of scanning through the entire file and then finding out occurrence in reverse order. But I don't think it is the right way to do it! (Coding preferably in C or Java)
I would also like to know another thing. When an interviewer specifically mentions that its a very large file, what are the factors that should be considered when we start writing the code (apart from keeping in mind the scanning is really costly affair)
To search a word in a large text, the Boyer Moore algorithm is extensively used.
Principle (see the link for a live example) : when starting the comparison at some place (index) in the file, if the first letter of the text being compared is not at all in the word being searched, there is no need to compare its other [wordLength - 1] characters with the text, and the index can move forward of the word length. If the letter is in the word, not here exactly, but shifted by a few chars, the comparison can also be shifted by a few chars etc...
edit Since you search from the end of the file, the 1st letter of the word (not the last) is to be compared at first. E.g. Searching "space" in "2001 a space odyssey", word space 's' is to be compared with the odyssey first 'y'. Next comparison is the same 's' against the text space 'c'.
And finally, to find the nth occurrence, a simple counter (initialized to n) is decremented each time the word is found, when it reaches 0, that's it.
The algorithm is easy to understand and to implement. Ideal for interviews.
You may ask also if the file is to be searched only once or several times? If it is intended to be searched multiple times, you can suggest to index the words from the file. I.e. create in memory a structure that allows to find quickly if a word is in it, where, how many times etc... I like the Trie algorithm also easy to understand, and very fast (can be pretty memory greedy also depending on the text). Its complexity is O(wordLength).
--
When the interviewer mentions "very large file" there are many factors to be considered, like
...
There are two parts of the answer to this question. One is the algorithm used which can be any good string search algorithm (Boyer Moore / KMP / Trie etc). The other part is IO. To speed up things since you can't really read backwards from a file a good approach will be :
This is a heavily IO oriented question and you can improve this approach using multithreaded systems or multiple machines wherein you can do the search and read from file to memory (i.e., steps 3 and 4) in parallel.
This is C++ code but you know how to do it in Java.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With