Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to verify if a file is readable by humans?

Tags:

java

file

How I can make sure that a file is readable by humans.

By that I essentially want to check if the file is a txt, a yml, a doc, a json file and so on.

The issue is that in the case i want to perform this check, file extensions are misleading, and by that i mean that a plain text file (That should be .txt) has an extension of .d and various others :- (

What is the best way to verify that a file can be read by humans?

So far i have tried my luck with extensions as follows:

private boolean humansCanRead(String extention) {
        switch (extention.toLowerCase()) {
        case "txt":
        case "doc":
        case "json":
        case "yml":
        case "html":
        case "htm":
        case "java":
        case "docx":
            return true;
        default:
            return false;
        }
    }

But as i said extensions are not as expected.

EDIT: To clarify, i am looking for a solution that is platform independed and without using external libraries, And to narrow down what i mean "human readable", i mean plain text files that contain characters of any language, also i dont really mind if the text in the file makes sense like if it is encoded, i dont really care at this point.

Thanks so far for all the responses! :D

like image 616
fill͡pant͡ Avatar asked Jun 22 '15 09:06

fill͡pant͡


People also ask

What is a human readable file?

A human-readable medium or human-readable format is any encoding of data or information that can be naturally read by humans.

What is the command to find human readable file in Linux?

This will find any files (NOTE: it will not find symlinks directories sockets, etc., only regular files) in /dir/to/search and run sh -c 'file -b {} | grep text &>/dev/null' ; which looks at the type of file and looks for text in the description.

How can you tell if a file is plain text?

You can call the shell command file -i ${filename} from Java, and check the output to see if it contains something like charset=binary . If it does, then it is binary file. Otherwise it is text based file. In Java you can also call shell commands.


2 Answers

In general, you cannot do that. You could use a language identification algorithm to guess whether a given text is a text that could be spoken by humans. Since your example contains formal languages like html, however, you are in some deep trouble. If you really want to implement your check for (a finite set of) formal languages, you could use a GLR parser to parse the (ambiguous) grammar that combines all these languages. This, however would not yet solve the problem of syntax-errors (although it might be possible to define a heuristic). Finally, you need to consider what you actually mean by "human readable": E.g. do you include Base64?

edit: In case you are only interested in the character set: See this questions' answer. Basically, you have to read the file and check whether the content is valid in whatever character encoding you think of as human readable (utf-8 should cover most of your real-world cases).

like image 55
choeger Avatar answered Oct 28 '22 12:10

choeger


For some files, a check on the proportion of bytes in the printable ASCII range will help. If more than 75% of the bytes are in that range within the first few hundred bytes then it is probably 'readable'.

Some files have headers, like the various forms of BoM on UTF files, the 0xA5EC which starts MS doc files or the "MZ" signature at the start of .exe, which will tell you if the file is readable or not.

A lot of modern text files are in one of the UTF formats, which can usually be identified by reading the first chunk of the file, even if they don't have a BoM.

Basically, you are going to have to run through a lot of different file types to see if you get a match. Load the first kilobyte of the file into memory and run a lot of different checks on it. Once you have some data, you can order the checks to look for the most common formats first.

like image 35
rossum Avatar answered Oct 28 '22 12:10

rossum