Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bug with find.exe?

In C++ we have a method to search for text in a file. It works by reading the file to a variable, and using strstr. But we got into trouble when the file got very large.

I thought I could solve this by calling find.exe using _popen. It works find, except when these conditions are all true:

  • The file is of type unicode (BOM=FFFE)
  • The file is EXACTLY 4096 bytes
  • The text you are searching for is the last text in the file

To recreate, you can do this:

  1. Open notepad
  2. Insert 2046 X's then an A at the end
  3. Save as test.txt, encoding = "unicode"
  4. Verify that file is exactly 4096 bytes
  5. Open a command prompt and type: find "A" /c test2.txt -> No hits

I also tried this:

  • Add or remove an X, and you will get a hit (file is not 4096 bytes anymore)
  • Save as UTF-8 (and add enough X's so that the file is 4096 bytes again), and you get a hit
  • Search for something in the middle of the file (file still unicode and 4096 bytes), and you get a hit.

Is this a bug, or is there something I'm missing?

like image 505
Arve Hansen Avatar asked Apr 11 '13 08:04

Arve Hansen


1 Answers

Very interesting bug.

This question caused me to do some experiments on XP and Win 7 - the behaviors are different.

XP

ANSI - FIND cannot read past 1023 characters (1023 bytes) on a single line. FIND can match a line that exceeds 1023 characters as long as the search string matches before the 1024th. The matching line printout is truncated after 1023 characters.

Unicode - FIND cannot read past 1024 characters (2048 bytes) on a single line. FIND can match a line that exceeds 1024 characters as long as the search string matches before the 1025th. The matching line printout is truncated after 1024 characters.

I find it very odd that the line limits for Unicode and ANSI on XP are not the same number of bytes, nor is it a simple multiple. The Unicode limit expressed as bytes is 2 times the limit for ANSI plus 1.

Note: truncation of matching long lines also truncates the new-line character, so the next matching line will appear to be appended to the previous line. You can tell it is a new line if you use the /N option.

Window 7

ANSI - I have not found a limit to the max line length that can be searched, (though I did not try very hard). Any matching line that exceeds 4095 characters (4095 bytes) is truncated after 4095 characters. FIND can successfully search past 4095 characters on a line, it just can't display all of them.

Unicode - I have not found a limit to the max line length that can be searched, (though I did not try very hard). Any matching line that exceeds 2047 characters (4094 bytes) is truncated after 2047 characters. FIND can successfully search past 2047 characters on a line, it just can't display all of them.

Since Unicode byte lengths are always a multiple of 2, and the max ANSI displayable length is an odd number, it makes sense that the max displayable line length in bytes is one less for Unicode than for ANSI.

But then there is also the weird Unicode bug. If the Unicode file length is an exact multiple of 4096 bytes, then the last character cannot be searched or printed. It does not matter if the file contains a single line or multiple lines. It only depends on the total file length.

I find it interesting that the multiple of 4096 bug is within one of the max printable line length (in bytes). But I don't know if there is a relationship between those behaviors or if it is simply coincidence.

Note: truncation of matching long lines also truncates any new-line character, so the next matching line will appear to be appended to the previous line. You can tell it is a new line if you use the /N option.

like image 51
dbenham Avatar answered Oct 07 '22 12:10

dbenham