How can I verify if the file is binary or text without to open the file?
File extensions We can usually tell if a file is binary or text based on its file extension. This is because by convention the extension reflects the file format, and it is ultimately the file format that dictates whether the file data is binary or text.
Text files are organized around lines, each of which ends with a newline character ('\n'). The source code files are themselves text files. A binary file is the one in which data is stored in the file in the same way as it is stored in the main memory for processing.
SQL files contain a Byte Order Mark (BOM), this is the 0xFEFF sequence you're seeing at the start of the file. Unfortunately these bytes means git treats the file as binary, not text, and so any operations that depend on git, such as generating a diff, aren't going to return what you expect.
An HTML file, is a text file too, even though it contains lots of characters that are invisible when viewed in a browser. It is considered a text file even though a newline, as described above, won't cause the next character to be displayed on the next line when viewed through a browsers.
Schrödinger's cat, I'm afraid.
There is no way to determine the contents of a file without opening it. The filesystem stores no metadata relating to the contents.
If not opening the file is not a hard requirement, then there are a number of solutions available to you.
Edit:
It has been suggested in a number of comments and answers that file(1)
is a good way of determining the contents. Indeed it is. However, file(1)
opens the file, which was prohibited in the question. See the penultimate line in the following example:
> echo 'This is not a pipe' > file.jpg && strace file file.jpg 2>&1 | grep file.jpg
execve("/usr/bin/file", ["file", "file.jpg"], [/* 56 vars */]) = 0
lstat64("file.jpg", {st_mode=S_IFREG|0644, st_size=19, ...}) = 0
stat64("file.jpg", {st_mode=S_IFREG|0644, st_size=19, ...}) = 0
open("file.jpg", O_RDONLY|O_LARGEFILE) = 3
write(1, "file.jpg: ASCII text\n", 21file.jpg: ASCII text
The correct way to determine the type of a file is to use the file(1) command.
You also need to be aware that UTF-8 encoded files are "text" files, but may contain non-ASCII data. Other encodings also have this issue. In the case of text encoded with a code page, it may not be possible to unambiguously determine if a file is text or not.
The file(1) command will look at the structure of a file to try and determine what it contains - from the file(1) man page:
The type printed will usually contain one of the words text (the file contains only printing characters and a few common control characters and is probably safe to read on an ASCII terminal), executable (the file contains the result of compiling a program in a form understandable to some UNIX kernel or another), or data meaning anything else (data is usually ‘binary’ or non-printable).
With regard to different character encodings, the file(1) man page has this to say:
If a file does not match any of the entries in the magic file, it is examined to see if it seems to be a text file. ASCII, ISO-8859-x, non- ISO 8-bit extended-ASCII character sets (such as those used on Macintosh and IBM PC systems), UTF-8-encoded Unicode, UTF-16-encoded Unicode, and EBCDIC character sets can be distinguished by the different ranges and sequences of bytes that constitute printable text in each set. If a file passes any of these tests, its character set is reported. ASCII, ISO-8859-x, UTF-8, and extended-ASCII files are identified as ‘text’ because they will be mostly readable on nearly any terminal; UTF-16 and EBCDIC are only ‘character data’ because, while they contain text, it is text that will require translation before it can be read.
So, some text will be identified as text, but some may be identified as character data. You will need to determine yourself if this matters to your application and take appropriate action.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With