Is Apache Tika able to extract foreign languages like Chinese, Japanese?
I have the following code:
Detector detector = new DefaultDetector();
Parser parser = new AutoDetectParser(detector);
InputStream stream = new ByteArrayInputStream(bytes);
OutputStream outputstream = new ByteArrayOutputStream();
ContentHandler textHandler = new BodyContentHandler(outputstream);
Metadata metadata = new Metadata();
// Set<String> langs = LanguageIdentifier.getSupportedLanguages();
// metadata.set(Metadata.CONTENT_LANGUAGE, lang);
// metadata.set(Metadata.FORMAT, hint);
ParseContext context = new ParseContext();
try {
parser.parse(stream, textHandler, metadata, context);
String extractedText = outputstream.toString();
return extractedText;
} catch (IOException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (TikaException e) {
e.printStackTrace();
}
If the input is a doc file that contains Chinese characters, each Chinese characters will be extracted as "?".
Thanks a lot!
Apache Tika is able to extract unicode text from its supported file formats. As long as the file format can store unicode text (eg Chinese or Japanese characters), Apache Tika can extract it
Tika also includes a number of unit tests for this, which verify it works. One such test uses this sample chinese email. If with use the command line Tika app, and grab the first few lines, we see it working:
$ java -jar tika-app-1.4.jar --text testMSG_chinese.msg | head
Alfresco MSG format testing ( MSG 格式測試 )
From
Tests Chang@FT (張毓倫)
To
Tests Chang@FT (張毓倫)
Recipients
[email protected]
Or with this Japanese document:
$ java -jar tika-app-1.4.jar --text testRTFJapanese.rtf | head -2
ゾルゲの処刑記録、
ゾルゲと尾崎、淡々と最期
You'll just need to ensure that any text output you generate gets stored in a suitable encoding (eg utf8), and the font you use to display it supports those glyphs!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With