In my program, the user can load a file with links (it's a webcrawler), but I need to verify if the file that the user chooses is plain text or something else (only plain text will be allowed).
Is it possible to do this? If it's useful, I'm using JFileChooser to open the file.
EDIT:
What is expected from the user: a text file containing URLs.
What I want to avoid: the user loads an MP3 file or a document from the MS Word (examples).
File extensions We can usually tell if a file is binary or text based on its file extension. This is because by convention the extension reflects the file format, and it is ultimately the file format that dictates whether the file data is binary or text.
To view a plaintext file, a text editor, such as Microsoft Notepad is used. However, all text editors including Microsoft WordPad and Word can also be used to view plaintext files because they have no special formatting.
For example, plaintext emails are messages that contain only text. Promotional email campaigns often use plaintext messages to avoid strict spam-filtering systems that tend to block messages that are HyperText Markup Language-encoded or that add other binary components.
Plain text refers to any string (i.e., finite sequence of characters) that consists entirely of printable characters (i.e., human-readable characters) and, optionally, a very few specific types of control characters (e.g., characters indicating a tab or the start of a new line).
A file is just a series of bytes, and without further information, you cannot tell whether these bytes are supposed to be code points in some string encoding (say, ASCII or UTF-8 or ANSI-something) or something else. You will have to resort to heuristics, such as:
But here's another solution: Just treat everything you receive as text, applying the necessary transformations where needed (e.g. HTML-encode when sending to a web browser). As long as you prevent the file from being interpreted as binary data (such as a user double-clicking the file), the worst you'll produce is gibberish data.
Text is also a form of binary data.
I suppose what you want to check is whether there are any characters in your input that are < 32. If you can safely assume that your text is multi-byte encoded, then you could just scan through the entire file and abort if you hit a byte in the range [0, 32) (excluding 9, 10, 13, and whatever else you may except in "text" -- or worst-case only check for null bytes [thanks, tdammers!]). If you could plausibly expect to receive UTF-16 or UTF-32 encoded text, you'll have to work harder.
If you do not want to guess by file extension, you may read the first portion of the file. But the next problem will be the character encoding. Using a BufferedInputStream
(mark()
before and reset()
afterwards), wrap with a InputStreamReader
with encoding "ISO-8859-1"
and count the read character with Character.isLetterOrDigit()
or Character.isWhitespace()
to get a ratio of typical text content. I think the ratio should be more than 80% for a text file.
You can also try other encoding like UTF-8, but you may get problems with invalid caracters when it is not UTF-8.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With