How to verify if a file is readable by humans?

Tags:

file

How I can make sure that a file is readable by humans.

By that I essentially want to check if the file is a txt, a yml, a doc, a json file and so on.

The issue is that in the case i want to perform this check, file extensions are misleading, and by that i mean that a plain text file (That should be .txt) has an extension of .d and various others :- (

What is the best way to verify that a file can be read by humans?

So far i have tried my luck with extensions as follows:

private boolean humansCanRead(String extention) {
        switch (extention.toLowerCase()) {
        case "txt":
        case "doc":
        case "json":
        case "yml":
        case "html":
        case "htm":
        case "java":
        case "docx":
            return true;
        default:
            return false;
        }
    }

But as i said extensions are not as expected.

EDIT: To clarify, i am looking for a solution that is platform independed and without using external libraries, And to narrow down what i mean "human readable", i mean plain text files that contain characters of any language, also i dont really mind if the text in the file makes sense like if it is encoded, i dont really care at this point.

Thanks so far for all the responses! :D

616

asked Jun 22 '15 09:06

fill͡pant͡

2 Answers

In general, you cannot do that. You could use a language identification algorithm to guess whether a given text is a text that could be spoken by humans. Since your example contains formal languages like html, however, you are in some deep trouble. If you really want to implement your check for (a finite set of) formal languages, you could use a GLR parser to parse the (ambiguous) grammar that combines all these languages. This, however would not yet solve the problem of syntax-errors (although it might be possible to define a heuristic). Finally, you need to consider what you actually mean by "human readable": E.g. do you include Base64?

edit: In case you are only interested in the character set: See this questions' answer. Basically, you have to read the file and check whether the content is valid in whatever character encoding you think of as human readable (utf-8 should cover most of your real-world cases).

answered Oct 28 '22 12:10

choeger

For some files, a check on the proportion of bytes in the printable ASCII range will help. If more than 75% of the bytes are in that range within the first few hundred bytes then it is probably 'readable'.

Some files have headers, like the various forms of BoM on UTF files, the 0xA5EC which starts MS doc files or the "MZ" signature at the start of .exe, which will tell you if the file is readable or not.

A lot of modern text files are in one of the UTF formats, which can usually be identified by reading the first chunk of the file, even if they don't have a BoM.

Basically, you are going to have to run through a lot of different file types to see if you get a match. Load the first kilobyte of the file into memory and run a lot of different checks on it. Once you have some data, you can order the checks to look for the most common formats first.

answered Oct 28 '22 12:10

rossum

Related questions
                            
                                Why is main method not getting called same no. of times recursively in java every time it is executed?
                            
                                How do you link to a _package description_ (not class) in the javadoc?
                            
                                How to use custom type annotations in Java
                            
                                Multi-module IntelliJ project with maven - How to add dependencies from one module to another?
                            
                                Iterating a WeakHashMap
                            
                                No compiler error about incompatible casts
                            
                                Installed dark theme in eclipse, but scrollbars is still grey
                            
                                Java 8 lambda api
                            
                                ColumnTransformer in hibernate
                            
                                Automatic Layout for a Graph (JGraphX)
                            
                                java.lang.NoClassDefFoundError: org.bouncycastle.jce.provider.BouncyCastleProvider
                            
                                Is it possible to register a receiver in a test case?
                            
                                How to use native SQL as a fragment (where clause) of a bigger query made with Criteria API in Hibernate?
                            
                                Jaxb ignore the namespace on unmarshalling
                            
                                list available java packages and classes within clojure
                            
                                BouncyCastle 1.51 loading in war on Wildfly 8.0
                            
                                Android: EditText causing memory leak
                            
                                Returning string from jms
                            
                                jTidy pretty print custom HTML tag
                            
                                View caching for Headers and footers in a list view

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With