Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient way to search a stream for a string

Let's suppose that have a stream of text (or Reader in Java) that I'd like to check for a particular string. The stream of text might be very large so as soon as the search string is found I'd like to return true and also try to avoid storing the entire input in memory.

Naively, I might try to do something like this (in Java):

public boolean streamContainsString(Reader reader, String searchString) throws IOException {     char[] buffer = new char[1024];     int numCharsRead;     while((numCharsRead = reader.read(buffer)) > 0) {         if ((new String(buffer, 0, numCharsRead)).indexOf(searchString) >= 0)             return true;     }     return false; } 

Of course this fails to detect the given search string if it occurs on the boundary of the 1k buffer:

Search text: "stackoverflow"
Stream buffer 1: "abc.........stack"
Stream buffer 2: "overflow.......xyz"

How can I modify this code so that it correctly finds the given search string across the boundary of the buffer but without loading the entire stream into memory?

Edit: Note when searching a stream for a string, we're trying to minimise the number of reads from the stream (to avoid latency in a network/disk) and to keep memory usage constant regardless of the amount of data in the stream. Actual efficiency of the string matching algorithm is secondary but obviously, it would be nice to find a solution that used one of the more efficient of those algorithms.

like image 658
Alex Spurling Avatar asked May 10 '09 21:05

Alex Spurling


People also ask

How do I search in stream?

Search across Stream (Classic)Type in a word or phrase into the Search box at the top of Microsoft Stream. Press enter or click the magnifying glass.

Which method is used to read a string from the input stream?

read() method reads the next byte of the data from the the input stream and returns int in the range of 0 to 255.

What is difference between stream and string?

Strings are arrays of characters used to hold data like "Hi I am a string." A Stream is an i/o class that is used to read and write bytes of data as a continuous sequence of bytes. You can use streams to read and write data from/to files, among other things.


1 Answers

There are three good solutions here:

  1. If you want something that is easy and reasonably fast, go with no buffer, and instead implement a simple nondeterminstic finite-state machine. Your state will be a list of indices into the string you are searching, and your logic looks something like this (pseudocode):

    String needle; n = needle.length();  for every input character c do   add index 0 to the list   for every index i in the list do     if c == needle[i] then       if i + 1 == n then         return true       else         replace i in the list with i + 1       end     else       remove i from the list     end   end end 

    This will find the string if it exists and you will never need a buffer.

  2. Slightly more work but also faster: do an NFA-to-DFA conversion that figures out in advance what lists of indices are possible, and assign each one to a small integer. (If you read about string search on Wikipedia, this is called the powerset construction.) Then you have a single state and you make a state-to-state transition on each incoming character. The NFA you want is just the DFA for the string preceded with a state that nondeterministically either drops a character or tries to consume the current character. You'll want an explicit error state as well.

  3. If you want something faster, create a buffer whose size is at least twice n, and user Boyer-Moore to compile a state machine from needle. You'll have a lot of extra hassle because Boyer-Moore is not trivial to implement (although you'll find code online) and because you'll have to arrange to slide the string through the buffer. You'll have to build or find a circular buffer that can 'slide' without copying; otherwise you're likely to give back any performance gains you might get from Boyer-Moore.

like image 122
Norman Ramsey Avatar answered Sep 20 '22 05:09

Norman Ramsey