Let's suppose that have a stream of text (or Reader in Java) that I'd like to check for a particular string. The stream of text might be very large so as soon as the search string is found I'd like to return true and also try to avoid storing the entire input in memory. Naively, I might try to do something like this (in Java): <pre class="prettyprint"><code>public boolean streamContainsString(Reader reader, String searchString) throws IOException { char[] buffer = new char[1024]; int numCharsRead; while((numCharsRead = reader.read(buffer)) > 0) { if ((new String(buffer, 0, numCharsRead)).indexOf(searchString) >= 0) return true; } return false; } </code></pre> Of course this fails to detect the given search string if it occurs on the boundary of the 1k buffer: Search text: "stackoverflow" Stream buffer 1: "abc.........stack" Stream buffer 2: "overflow.......xyz" How can I modify this code so that it correctly finds the given search string across the boundary of the buffer but without loading the entire stream into memory? Edit: Note when searching a stream for a string, we're trying to minimise the number of reads from the stream (to avoid latency in a network/disk) and to keep memory usage constant regardless of the amount of data in the stream. Actual efficiency of the string matching algorithm is secondary but obviously, it would be nice to find a solution that used one of the more efficient of those algorithms.

There are three good solutions here: <ol> <li> If you want something that is easy and reasonably fast, go with no buffer, and instead implement a simple nondeterminstic finite-state machine. Your state will be a list of indices into the string you are searching, and your logic looks something like this (pseudocode): <pre class="prettyprint"><code>String needle; n = needle.length(); for every input character c do add index 0 to the list for every index i in the list do if c == needle[i] then if i + 1 == n then return true else replace i in the list with i + 1 end else remove i from the list end end end </code></pre> This will find the string if it exists and you will never need a buffer. </li> <li>Slightly more work but also faster: do an NFA-to-DFA conversion that figures out in advance what lists of indices are possible, and assign each one to a small integer. (If you read about string search on Wikipedia, this is called the powerset construction.) Then you have a single state and you make a state-to-state transition on each incoming character. The NFA you want is just the DFA for the string preceded with a state that nondeterministically either drops a character or tries to consume the current character. You'll want an explicit error state as well.</li> <li>If you want something faster, create a buffer whose size is at least twice <code>n</code>, and user Boyer-Moore to compile a state machine from <code>needle</code>. You'll have a lot of extra hassle because Boyer-Moore is not trivial to implement (although you'll find code online) and because you'll have to arrange to slide the string through the buffer. You'll have to build or find a circular buffer that can 'slide' without copying; otherwise you're likely to give back any performance gains you might get from Boyer-Moore.</li> </ol>

Efficient way to search a stream for a string

Tags:

java

string

algorithm

stream

search

Let's suppose that have a stream of text (or Reader in Java) that I'd like to check for a particular string. The stream of text might be very large so as soon as the search string is found I'd like to return true and also try to avoid storing the entire input in memory.

Naively, I might try to do something like this (in Java):

Click to copy

public boolean streamContainsString(Reader reader, String searchString) throws IOException {     char[] buffer = new char[1024];     int numCharsRead;     while((numCharsRead = reader.read(buffer)) > 0) {         if ((new String(buffer, 0, numCharsRead)).indexOf(searchString) >= 0)             return true;     }     return false; }

Of course this fails to detect the given search string if it occurs on the boundary of the 1k buffer:

Search text: "stackoverflow"
Stream buffer 1: "abc.........stack"
Stream buffer 2: "overflow.......xyz"

How can I modify this code so that it correctly finds the given search string across the boundary of the buffer but without loading the entire stream into memory?

Edit: Note when searching a stream for a string, we're trying to minimise the number of reads from the stream (to avoid latency in a network/disk) and to keep memory usage constant regardless of the amount of data in the stream. Actual efficiency of the string matching algorithm is secondary but obviously, it would be nice to find a solution that used one of the more efficient of those algorithms.

658

asked May 10 '09 21:05

Alex Spurling

1 Answers

There are three good solutions here:

If you want something that is easy and reasonably fast, go with no buffer, and instead implement a simple nondeterminstic finite-state machine. Your state will be a list of indices into the string you are searching, and your logic looks something like this (pseudocode):

Click to copy

String needle; n = needle.length();  for every input character c do   add index 0 to the list   for every index i in the list do     if c == needle[i] then       if i + 1 == n then         return true       else         replace i in the list with i + 1       end     else       remove i from the list     end   end end

This will find the string if it exists and you will never need a buffer.

Slightly more work but also faster: do an NFA-to-DFA conversion that figures out in advance what lists of indices are possible, and assign each one to a small integer. (If you read about string search on Wikipedia, this is called the powerset construction.) Then you have a single state and you make a state-to-state transition on each incoming character. The NFA you want is just the DFA for the string preceded with a state that nondeterministically either drops a character or tries to consume the current character. You'll want an explicit error state as well.
If you want something faster, create a buffer whose size is at least twice n, and user Boyer-Moore to compile a state machine from needle. You'll have a lot of extra hassle because Boyer-Moore is not trivial to implement (although you'll find code online) and because you'll have to arrange to slide the string through the buffer. You'll have to build or find a circular buffer that can 'slide' without copying; otherwise you're likely to give back any performance gains you might get from Boyer-Moore.

122

answered Sep 20 '22 05:09

Norman Ramsey

Related questions
                            
                                TypeNotPresentExceptionProxy
                            
                                Google Cloud Endpoints limitations... any proposed solutions?
                            
                                Why does Java allow control characters in its identifiers?
                            
                                Why can't a Generic Type Parameter have a lower bound in Java?
                            
                                Run piece of code contained in a String
                            
                                In Eclipse, what is the difference between modulepath and classpath?
                            
                                Difference between intellij Project make and Maven Compile?
                            
                                A SQLiteConnection object for database was leaked! Please fix your application
                            
                                Moving the instruction pointer while debugging Java in Eclipse
                            
                                Difference between generic type and wildcard type
                            
                                Maven GWT 2.0 and Eclipse
                            
                                How Threadpool re-use Threads and how it works
                            
                                What are the differences between Maven Jar Plugin and Maven Assembly Plugin?
                            
                                How can I set a description and an example in Swagger with Swagger annotations?
                            
                                Where to put @Bean in Spring Boot?
                            
                                Immutable @ConfigurationProperties
                            
                                Difference between Java's `Double.MIN_NORMAL` and `Double.MIN_VALUE`?
                            
                                Does autoboxing call valueOf()?
                            
                                How to build a docker container for a Java application
                            
                                What does the Hibernate proxy object contain?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With