In my program, the user can load a file with links (it's a webcrawler), but I need to verify if the file that the user chooses is plain text or something else (only plain text will be allowed). Is it possible to do this? If it's useful, I'm using JFileChooser to open the file. EDIT: What is expected from the user: a text file containing URLs. What I want to avoid: the user loads an MP3 file or a document from the MS Word (examples).

A file is just a series of bytes, and without further information, you cannot tell whether these bytes are supposed to be code points in some string encoding (say, ASCII or UTF-8 or ANSI-something) or something else. You will have to resort to heuristics, such as: <ul> <li>Try to parse the file in a number of known encodings and see if the parsing succeeds. If it does, chances are you have a text file.</li> <li>If you expect text files in Western languages only, you can assume that the majority of characters lies in the ASCII range (0..127), more specifically, (33..127) plus whitespace (tab, newline, carriage return, space). Count occurrences of each distinct byte value, and if the overwhelming part of your document is in the 'typical western characters' set, it's usually safe to assume it's a text file.</li> <li>Extending the previous approach; sample a sufficiently large quantity of text in the languages you expect, and build a character frequency profile. To check your file, compare the file's character frequency profile against your test data and see if it's close enough.</li> </ul> But here's another solution: Just treat everything you receive as text, applying the necessary transformations where needed (e.g. HTML-encode when sending to a web browser). As long as you prevent the file from being interpreted as binary data (such as a user double-clicking the file), the worst you'll produce is gibberish data.

If you do not want to guess by file extension, you may read the first portion of the file. But the next problem will be the character encoding. Using a <code>BufferedInputStream</code> (<code>mark()</code> before and <code>reset()</code> afterwards), wrap with a <code>InputStreamReader</code> with encoding <code>"ISO-8859-1"</code> and count the read character with <code>Character.isLetterOrDigit()</code> or <code>Character.isWhitespace()</code> to get a ratio of typical text content. I think the ratio should be more than 80% for a text file. You can also try other encoding like UTF-8, but you may get problems with invalid caracters when it is not UTF-8.

How to check if a file is plain text?

3 Answers

A file is just a series of bytes, and without further information, you cannot tell whether these bytes are supposed to be code points in some string encoding (say, ASCII or UTF-8 or ANSI-something) or something else. You will have to resort to heuristics, such as:

Try to parse the file in a number of known encodings and see if the parsing succeeds. If it does, chances are you have a text file.
If you expect text files in Western languages only, you can assume that the majority of characters lies in the ASCII range (0..127), more specifically, (33..127) plus whitespace (tab, newline, carriage return, space). Count occurrences of each distinct byte value, and if the overwhelming part of your document is in the 'typical western characters' set, it's usually safe to assume it's a text file.
Extending the previous approach; sample a sufficiently large quantity of text in the languages you expect, and build a character frequency profile. To check your file, compare the file's character frequency profile against your test data and see if it's close enough.

But here's another solution: Just treat everything you receive as text, applying the necessary transformations where needed (e.g. HTML-encode when sending to a web browser). As long as you prevent the file from being interpreted as binary data (such as a user double-clicking the file), the worst you'll produce is gibberish data.

177

answered Oct 18 '22 20:10

tdammers

Text is also a form of binary data.

I suppose what you want to check is whether there are any characters in your input that are < 32. If you can safely assume that your text is multi-byte encoded, then you could just scan through the entire file and abort if you hit a byte in the range [0, 32) (excluding 9, 10, 13, and whatever else you may except in "text" -- or worst-case only check for null bytes [thanks, tdammers!]). If you could plausibly expect to receive UTF-16 or UTF-32 encoded text, you'll have to work harder.

answered Oct 18 '22 22:10

Kerrek SB

If you do not want to guess by file extension, you may read the first portion of the file. But the next problem will be the character encoding. Using a BufferedInputStream (mark() before and reset() afterwards), wrap with a InputStreamReader with encoding "ISO-8859-1" and count the read character with Character.isLetterOrDigit() or Character.isWhitespace() to get a ratio of typical text content. I think the ratio should be more than 80% for a text file.

You can also try other encoding like UTF-8, but you may get problems with invalid caracters when it is not UTF-8.

answered Oct 18 '22 20:10

Arne Burmeister

Related questions
                            
                                Log4j has no support for binary logging format?
                            
                                Programmatic Jetty shutdown
                            
                                How to execute a bean method before page is rendered?
                            
                                Extracting common exception handling code of several methods in Java
                            
                                Passing a pointer from JNI to Java using a long
                            
                                Setting the Global Font for a Java Application
                            
                                Strategies for Java ORM with Unreliable Network and Low Bandwidth
                            
                                finding the width of a binary tree
                            
                                Java read values from text file
                            
                                Using a custom class as a JAX-WS return type?
                            
                                Using String.format() as annotation attribute value
                            
                                JSTL sql:query variable
                            
                                Restrict multiple instances of an application in java
                            
                                start with sshj
                            
                                Memory footprint for a java application
                            
                                ClassNotFoundException when deserializing a binary class file's contents
                            
                                Getting Light Sensor Value
                            
                                Javadoc for project documentation [closed]
                            
                                Is it possible to define custom types in java that work with primitives?
                            
                                Logical comparison of Java synchronized keyword and Spring @Transactional annotation

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to check if a file is plain text?

Tags:

java

text

binary-data

Renato Dinhani

People also ask

3 Answers

tdammers

Kerrek SB

Arne Burmeister

Recent Activity

Donate For Us