I have a large binary file (many gigabytes, so loading it into memory is not an option) that I want to search for all occurrences of the string "icpf".
I tried using std::search
for this, but just got bitten by the fact that std::search
only works for forward iterators, not input iterators.
Does the standard library provide a fast alternative for this? Or do I need to hand-code the search (either reading in chunks at a time then std::search
on those, or ignore
everything until an 'i' and then manually check the next three characters)?
InputStream. read() method reads the next byte of the data from the the input stream and returns int in the range of 0 to 255.
The BufferedReader can't read the InputStream directly; So, we need to use an adapter like InputStreamReader to convert bytes to characters format. For example: // BufferedReader -> InputStreamReader -> InputStream BufferedReader br = new BufferedReader( new InputStreamReader(inputStream, StandardCharsets. UTF_8));
Does the standard library provide a fast alternative for this?
Although the standard C++ library offers ways to search text streams, it does not offer comparable algorithms for binary streams.
Or do I need to hand-code the search (either reading in chunks at a time then
std::search
on those, or ignore everything until an'i'
and then manually check the next three characters)?
Coding the "skip and search" approach could be tricky, because it is easy to code a solution that skips entries. For example, if you are looking for "icpf"
in a file containing "icpicpf"
, a simple program that processes one character at a time will fail to find "icpf"
suffix after discarding "icpi"
prefix.
If you are going to code this yourself, consider implementing Knuth–Morris–Pratt algorithm. There are many implementations available online, and it operates correctly on streams, because it considers one character at a time, and never goes back.
The fastest method is to load the entire file into memory, then search the memory.
The next best alternative is to keep the hard drive in motion. Perhaps have one thread that reads chunks of data into a buffer and another thread that searches the buffer.
Going down the list, reading in large chunks of data into a buffer, then searching the buffer is a good technique, although not as efficient as the previous methods.
You could read line by line, using std::getline
and std::string
. This is not as fast as block reading because the input function is searching for the newline character (and allocating memory in the std::string
).
The worst case is probably reading character by character. The function overhead is bad for reading a single characters (usually the overhead is the same for reading a large block of data).
No, there is no standard C++ library function for searching files. Some Operating Systems have utilities for searching files; perhaps you can use one of those.
Edit 1:
The bottleneck is inputting the data. Once you get the data into a buffer, then are many efficient search algorithms rather than the brute force (searching for the first letter, then searching for the next letters, etc.).
Search the internet for "string search algorithm".
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With