Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Searching for a string in an input stream

Tags:

c++

iostream

I have a large binary file (many gigabytes, so loading it into memory is not an option) that I want to search for all occurrences of the string "icpf".

I tried using std::search for this, but just got bitten by the fact that std::search only works for forward iterators, not input iterators.

Does the standard library provide a fast alternative for this? Or do I need to hand-code the search (either reading in chunks at a time then std::search on those, or ignore everything until an 'i' and then manually check the next three characters)?

like image 605
zennehoy Avatar asked Feb 22 '16 17:02

zennehoy


People also ask

Which method is used to read a string from the input stream?

InputStream. read() method reads the next byte of the data from the the input stream and returns int in the range of 0 to 255.

How do I get BufferedReader from InputStream?

The BufferedReader can't read the InputStream directly; So, we need to use an adapter like InputStreamReader to convert bytes to characters format. For example: // BufferedReader -> InputStreamReader -> InputStream BufferedReader br = new BufferedReader( new InputStreamReader(inputStream, StandardCharsets. UTF_8));


2 Answers

Does the standard library provide a fast alternative for this?

Although the standard C++ library offers ways to search text streams, it does not offer comparable algorithms for binary streams.

Or do I need to hand-code the search (either reading in chunks at a time then std::search on those, or ignore everything until an 'i' and then manually check the next three characters)?

Coding the "skip and search" approach could be tricky, because it is easy to code a solution that skips entries. For example, if you are looking for "icpf" in a file containing "icpicpf", a simple program that processes one character at a time will fail to find "icpf" suffix after discarding "icpi" prefix.

If you are going to code this yourself, consider implementing Knuth–Morris–Pratt algorithm. There are many implementations available online, and it operates correctly on streams, because it considers one character at a time, and never goes back.

like image 85
Sergey Kalinichenko Avatar answered Sep 23 '22 20:09

Sergey Kalinichenko


The fastest method is to load the entire file into memory, then search the memory.

The next best alternative is to keep the hard drive in motion. Perhaps have one thread that reads chunks of data into a buffer and another thread that searches the buffer.

Going down the list, reading in large chunks of data into a buffer, then searching the buffer is a good technique, although not as efficient as the previous methods.

You could read line by line, using std::getline and std::string. This is not as fast as block reading because the input function is searching for the newline character (and allocating memory in the std::string).

The worst case is probably reading character by character. The function overhead is bad for reading a single characters (usually the overhead is the same for reading a large block of data).

No, there is no standard C++ library function for searching files. Some Operating Systems have utilities for searching files; perhaps you can use one of those.

Edit 1:
The bottleneck is inputting the data. Once you get the data into a buffer, then are many efficient search algorithms rather than the brute force (searching for the first letter, then searching for the next letters, etc.).

Search the internet for "string search algorithm".

like image 36
Thomas Matthews Avatar answered Sep 26 '22 20:09

Thomas Matthews