In C++ we have a method to search for text in a file. It works by reading the file to a variable, and using strstr. But we got into trouble when the file got very large.
I thought I could solve this by calling find.exe using _popen. It works find, except when these conditions are all true:
To recreate, you can do this:
I also tried this:
Is this a bug, or is there something I'm missing?
Very interesting bug.
This question caused me to do some experiments on XP and Win 7 - the behaviors are different.
XP
ANSI - FIND cannot read past 1023 characters (1023 bytes) on a single line. FIND can match a line that exceeds 1023 characters as long as the search string matches before the 1024th. The matching line printout is truncated after 1023 characters.
Unicode - FIND cannot read past 1024 characters (2048 bytes) on a single line. FIND can match a line that exceeds 1024 characters as long as the search string matches before the 1025th. The matching line printout is truncated after 1024 characters.
I find it very odd that the line limits for Unicode and ANSI on XP are not the same number of bytes, nor is it a simple multiple. The Unicode limit expressed as bytes is 2 times the limit for ANSI plus 1.
Note: truncation of matching long lines also truncates the new-line character, so the next matching line will appear to be appended to the previous line. You can tell it is a new line if you use the /N option.
Window 7
ANSI - I have not found a limit to the max line length that can be searched, (though I did not try very hard). Any matching line that exceeds 4095 characters (4095 bytes) is truncated after 4095 characters. FIND can successfully search past 4095 characters on a line, it just can't display all of them.
Unicode - I have not found a limit to the max line length that can be searched, (though I did not try very hard). Any matching line that exceeds 2047 characters (4094 bytes) is truncated after 2047 characters. FIND can successfully search past 2047 characters on a line, it just can't display all of them.
Since Unicode byte lengths are always a multiple of 2, and the max ANSI displayable length is an odd number, it makes sense that the max displayable line length in bytes is one less for Unicode than for ANSI.
But then there is also the weird Unicode bug. If the Unicode file length is an exact multiple of 4096 bytes, then the last character cannot be searched or printed. It does not matter if the file contains a single line or multiple lines. It only depends on the total file length.
I find it interesting that the multiple of 4096 bug is within one of the max printable line length (in bytes). But I don't know if there is a relationship between those behaviors or if it is simply coincidence.
Note: truncation of matching long lines also truncates any new-line character, so the next matching line will appear to be appended to the previous line. You can tell it is a new line if you use the /N option.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With