strings command prints strings of printable characters in a binary file. 
I am curious to know how it works on a high level.
It shouldn't be straightforward since each binary file has a different format, from executables to PDFs and others. So each byte can mean different things from an ASCII/Unicode character to other metadata.
So does it know all those binary file formats? Then, in that case it won't be able to work with some new type or cutsom binary file.
UPDATE: I know what strings command does. I just want to know how it does what it does.
strings does not attempt to parse all kinds of files. It scans any file for a long enough sequence of 'printable characters', and when found, shows it. See? No "parsing" involved. (With one exception.)
.. So each byte can mean different things from an ASCII/Unicode character to other metadata.
Only up to a certain point. strings is very straightforward, as it does not attempt to 'parse' for meanings. That is, it does not see the difference between a text string "Hello world" and any random binary sequence that happens to contain the bytes 0x48, 0x65, 0x6C, 0x6C, 0x6F (etc.) in that particular order.
The only allowance it has is you can tell it to (attempt to) interpret the raw bytes as a different character set:
-e encoding
--encoding=encoding
Select the character encoding of the strings that are to be found. Possible values for encoding are: s = single-7-bit-byte characters (ASCII, ISO 8859, etc., default), S = single-8-bit-byte characters, b = 16-bit bigendian, l = 16-bit littleendian, B = 32-bit bigen- dian, L = 32-bit littleendian. Useful for finding wide character strings.
(http://unixhelp.ed.ac.uk/CGI/man-cgi?strings)
and again, then it merely does what you told it to: when told to look for 7-bit ASCII only, it will skip high ASCII characters (even though these may appear in "valid text" inside the binary) and when told 8-bit is okay as well, it shows accented characters as well as random stuff such as ¿, ¼, ¢ and ².
As to parsing, you can infer from the man page there is a single exception:
Do not scan only the initialized and loaded sections of object files; scan the whole files..
where this "object file" is an executable type that your system supports. This may be pure pragmatically: executable binary headers are easily recognized and parsed (an example for "ELF" on SO itself), and mostly one is interested in the text stored in the executable/data part of a binary, and not in the more random bytes in its headers and relocation tables.
For each file given, GNU strings prints the printable character sequences that are at least 4 characters long (or the number given with the options below) and are followed by an unprintable character. By default, it only prints the strings from the initialized and loaded sections of object files; for other types of files, it prints the strings from the whole file.
strings is mainly useful for determining the contents of non-text files.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With