All,
I am trying to identify plain text files with Mac line endings and, inside an InputStream, silently convert them to Windows or Linux line endings (the important part is the LF character, really). Specifically, I'm working with several APIs that take InputStreams and are hard-locked to looking for \n as newlines.
Sometimes, I get binary files. Obviously, a file that isn't text-like shouldn't have this substitution done, because the value that happens to correspond to \r obviously can't silently be followed by a \n without mangling things badly.
I am attempting to use java.net.URLConnection.guessContentTypeFromStream
and only performing endline conversions if the type is text/plain. Unfortunately, "text/plain"
doesn't seem to be in its gamut of return values; all I get is null
for my flat text files, and it's possibly not safe to assume all unidentifiable files can be modified.
What better library (preferably in a public Maven repository and open-source) can I use to do this? Alternatively, how can I make guessContentTypeFromStream work for me? I know I'm describing an inherently hazardous application and no solution can be perfect, but should I just treat "null" as likely to be "text/plain" and I simply need to write more code myself to look for evidence that it isn't?
It seems to me that what you're asking is to determine if a file is textual or not. Given that, there is a solution here that seems right:
Granted, he is talking about unix, bash and perl but the concept is the same:
Unless you inspect every byte of the file, you are not going to get this 100%. And there is a big performance hit with inspecting every byte. But after some experiments, I settled on an algorithm that works for me. I examine the first line and declare the file to be binary if I encounter even one non-text byte. It seems a little slack, I know, but I seem to get away with it.
EDIT #1:
Expanding on this type of solution, it seems like a reasonable approach would be to ensure the file contains no non-ascii characters (unless you're dealing with files that are non-English...thats another solution). This could be done by checking if the file contents as a String does not match this:
// -- uses commons-io
String fileAsString = FileUtils.readFileToString( new File( "file-name-here" ) );
boolean isTextualFile = fileAsString.matches( ".*\\p{ASCII}+.*" );
EDIT #2
You may want to try this as your regex, or something close to it. Though, I'll admit it could likely use some refining.
".*(?:\\p{Print}|\\p{Space})+.*"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With