Let's suppose that have a stream of text (or Reader in Java) that I'd like to check for a particular string. The stream of text might be very large so as soon as the search string is found I'd like to return true and also try to avoid storing the entire input in memory.
Naively, I might try to do something like this (in Java):
public boolean streamContainsString(Reader reader, String searchString) throws IOException { char[] buffer = new char[1024]; int numCharsRead; while((numCharsRead = reader.read(buffer)) > 0) { if ((new String(buffer, 0, numCharsRead)).indexOf(searchString) >= 0) return true; } return false; }
Of course this fails to detect the given search string if it occurs on the boundary of the 1k buffer:
Search text: "stackoverflow"
Stream buffer 1: "abc.........stack"
Stream buffer 2: "overflow.......xyz"
How can I modify this code so that it correctly finds the given search string across the boundary of the buffer but without loading the entire stream into memory?
Edit: Note when searching a stream for a string, we're trying to minimise the number of reads from the stream (to avoid latency in a network/disk) and to keep memory usage constant regardless of the amount of data in the stream. Actual efficiency of the string matching algorithm is secondary but obviously, it would be nice to find a solution that used one of the more efficient of those algorithms.
Search across Stream (Classic)Type in a word or phrase into the Search box at the top of Microsoft Stream. Press enter or click the magnifying glass.
read() method reads the next byte of the data from the the input stream and returns int in the range of 0 to 255.
Strings are arrays of characters used to hold data like "Hi I am a string." A Stream is an i/o class that is used to read and write bytes of data as a continuous sequence of bytes. You can use streams to read and write data from/to files, among other things.
There are three good solutions here:
If you want something that is easy and reasonably fast, go with no buffer, and instead implement a simple nondeterminstic finite-state machine. Your state will be a list of indices into the string you are searching, and your logic looks something like this (pseudocode):
String needle; n = needle.length(); for every input character c do add index 0 to the list for every index i in the list do if c == needle[i] then if i + 1 == n then return true else replace i in the list with i + 1 end else remove i from the list end end end
This will find the string if it exists and you will never need a buffer.
Slightly more work but also faster: do an NFA-to-DFA conversion that figures out in advance what lists of indices are possible, and assign each one to a small integer. (If you read about string search on Wikipedia, this is called the powerset construction.) Then you have a single state and you make a state-to-state transition on each incoming character. The NFA you want is just the DFA for the string preceded with a state that nondeterministically either drops a character or tries to consume the current character. You'll want an explicit error state as well.
If you want something faster, create a buffer whose size is at least twice n
, and user Boyer-Moore to compile a state machine from needle
. You'll have a lot of extra hassle because Boyer-Moore is not trivial to implement (although you'll find code online) and because you'll have to arrange to slide the string through the buffer. You'll have to build or find a circular buffer that can 'slide' without copying; otherwise you're likely to give back any performance gains you might get from Boyer-Moore.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With