How can I verify if the file is binary or text without to open the file?

Schrödinger's cat, I'm afraid. There is no way to determine the contents of a file without opening it. The filesystem stores no metadata relating to the contents. If not opening the file is not a hard requirement, then there are a number of solutions available to you. Edit: It has been suggested in a number of comments and answers that <code>file(1)</code> is a good way of determining the contents. Indeed it is. However, <code>file(1)</code> opens the file, which was prohibited in the question. See the penultimate line in the following example: <pre class="prettyprint"><code>> echo 'This is not a pipe' > file.jpg && strace file file.jpg 2>&1 | grep file.jpg execve("/usr/bin/file", ["file", "file.jpg"], [/* 56 vars */]) = 0 lstat64("file.jpg", {st_mode=S_IFREG|0644, st_size=19, ...}) = 0 stat64("file.jpg", {st_mode=S_IFREG|0644, st_size=19, ...}) = 0 open("file.jpg", O_RDONLY|O_LARGEFILE) = 3 write(1, "file.jpg: ASCII text\n", 21file.jpg: ASCII text </code></pre>

The correct way to determine the type of a file is to use the file(1) command. You also need to be aware that UTF-8 encoded files are "text" files, but may contain non-ASCII data. Other encodings also have this issue. In the case of text encoded with a code page, it may not be possible to unambiguously determine if a file is text or not. The file(1) command will look at the structure of a file to try and determine what it contains - from the file(1) man page: <blockquote> The type printed will usually contain one of the words text (the file contains only printing characters and a few common control characters and is probably safe to read on an ASCII terminal), executable (the file contains the result of compiling a program in a form understandable to some UNIX kernel or another), or data meaning anything else (data is usually ‘binary’ or non-printable). </blockquote> With regard to different character encodings, the file(1) man page has this to say: <blockquote> If a file does not match any of the entries in the magic file, it is examined to see if it seems to be a text file. ASCII, ISO-8859-x, non- ISO 8-bit extended-ASCII character sets (such as those used on Macintosh and IBM PC systems), UTF-8-encoded Unicode, UTF-16-encoded Unicode, and EBCDIC character sets can be distinguished by the different ranges and sequences of bytes that constitute printable text in each set. If a file passes any of these tests, its character set is reported. ASCII, ISO-8859-x, UTF-8, and extended-ASCII files are identified as ‘text’ because they will be mostly readable on nearly any terminal; UTF-16 and EBCDIC are only ‘character data’ because, while they contain text, it is text that will require translation before it can be read. </blockquote> So, some text will be identified as text, but some may be identified as character data. You will need to determine yourself if this matters to your application and take appropriate action.

linux + verify if file is text or binary

2 Answers

Schrödinger's cat, I'm afraid.

There is no way to determine the contents of a file without opening it. The filesystem stores no metadata relating to the contents.

If not opening the file is not a hard requirement, then there are a number of solutions available to you.

Edit:

It has been suggested in a number of comments and answers that file(1) is a good way of determining the contents. Indeed it is. However, file(1) opens the file, which was prohibited in the question. See the penultimate line in the following example:

> echo 'This is not a pipe' > file.jpg && strace file file.jpg 2>&1 | grep file.jpg
execve("/usr/bin/file", ["file", "file.jpg"], [/* 56 vars */]) = 0
lstat64("file.jpg", {st_mode=S_IFREG|0644, st_size=19, ...}) = 0
stat64("file.jpg", {st_mode=S_IFREG|0644, st_size=19, ...}) = 0
open("file.jpg", O_RDONLY|O_LARGEFILE)  = 3
write(1, "file.jpg: ASCII text\n", 21file.jpg: ASCII text

170

answered Sep 22 '22 12:09

Johnsyweb

The correct way to determine the type of a file is to use the file(1) command.

You also need to be aware that UTF-8 encoded files are "text" files, but may contain non-ASCII data. Other encodings also have this issue. In the case of text encoded with a code page, it may not be possible to unambiguously determine if a file is text or not.

The file(1) command will look at the structure of a file to try and determine what it contains - from the file(1) man page:

The type printed will usually contain one of the words text (the file contains only printing characters and a few common control characters and is probably safe to read on an ASCII terminal), executable (the file contains the result of compiling a program in a form understandable to some UNIX kernel or another), or data meaning anything else (data is usually ‘binary’ or non-printable).

With regard to different character encodings, the file(1) man page has this to say:

If a file does not match any of the entries in the magic file, it is examined to see if it seems to be a text file. ASCII, ISO-8859-x, non- ISO 8-bit extended-ASCII character sets (such as those used on Macintosh and IBM PC systems), UTF-8-encoded Unicode, UTF-16-encoded Unicode, and EBCDIC character sets can be distinguished by the different ranges and sequences of bytes that constitute printable text in each set. If a file passes any of these tests, its character set is reported. ASCII, ISO-8859-x, UTF-8, and extended-ASCII files are identified as ‘text’ because they will be mostly readable on nearly any terminal; UTF-16 and EBCDIC are only ‘character data’ because, while they contain text, it is text that will require translation before it can be read.

So, some text will be identified as text, but some may be identified as character data. You will need to determine yourself if this matters to your application and take appropriate action.

answered Sep 19 '22 12:09

camh

Related questions
                            
                                Unzip and move a downloaded file - Linux [closed]
                            
                                How to print file tree with hadoop?
                            
                                How to find out the date of the last Saturday in Linux shell script or python?
                            
                                How do I get my IP address in C on Linux? [duplicate]
                            
                                What is the first process a typical Linux kernel starts?
                            
                                Prepend to command line arguments in linux/bash
                            
                                How to use dos2unix?
                            
                                Bash - Date command and space
                            
                                Get first character of a string SHELL
                            
                                Command to get the service status of mac os
                            
                                Run bash script after login
                            
                                In homebrew, how can I know xargs belongs to the findutil package?
                            
                                pip and pip3 - both pointing to python3.5?
                            
                                Remove Duplicate Dependencies in Maven Pom
                            
                                Kafka Console consumer with kerberos authentication
                            
                                Install python 2.7 on ubuntu 18.04
                            
                                How can I install php7.4 on Ubuntu 19.04?
                            
                                What is the best method to ping in c++ under linux?
                            
                                How to analyse a crash dump file using GDB
                            
                                Does accessing a single struct member pull the entire struct into the Cache?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

linux + verify if file is text or binary

Tags:

linux

lidia

People also ask

2 Answers

Johnsyweb

camh

Recent Activity

Donate For Us