I have a bunch of plain text file which I downloaded from 3rd party servers. Some of them are gibberish; the server sent the information of ENCODING1 (e.g.: UTF8), but in reality the encoding of the file was ENCODING2 (e.g.: Windows1252).
Is there a way to somehow correct these files?
I presume the files were (ENCODING1) mostly encoded in UTF8, ISO-8859-2 and Windows1252 (and I presume they were mostly saved with one of these encodings). I was thinking about re-encoding every filecontent with
new String(String.getBytes(ENCODING1), ENCODING2)
with all possibilites of ENCODING1 and ENCODING2 (for 3 encodings that would be 9 options) then finding some way (for example: charachter frequency?) to tell which of the 9 results is the correct one.
Are there any 3rd party libraries for this?
I tried JChardet and ICU4J, but as far as I know both of them are only capable of detecting the encoding of the file before the step with ENCODING1 took place
Thanks, krisy
You can use library provided by google to detect character set for a file, please see following:
import org.mozilla.universalchardet.UniversalDetector;
public class TestDetector
{
public static void main(String[] args) throws java.io.IOException
{
if (args.length != 1) {
System.err.println("Usage: java TestDetector FILENAME");
System.exit(1);
}
byte[] buf = new byte[4096];
String fileName = args[0];
java.io.FileInputStream fis = new java.io.FileInputStream(fileName);
// (1)
UniversalDetector detector = new UniversalDetector(null);
// (2)
int nread;
while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
detector.handleData(buf, 0, nread);
}
// (3)
detector.dataEnd();
// (4)
String encoding = detector.getDetectedCharset();
if (encoding != null) {
System.out.println("Detected encoding = " + encoding);
} else {
System.out.println("No encoding detected.");
}
// (5)
detector.reset();
}
}
Read more at following URL
You can also try jCharDet
by sourceforge, please see following URL
Cheers !!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With