Is Apache Tika able to extract foreign languages like Chinese, Japanese?

Question

I have the following code:

    Detector detector = new DefaultDetector();
    Parser parser = new AutoDetectParser(detector);
    InputStream stream = new ByteArrayInputStream(bytes);
    OutputStream outputstream = new ByteArrayOutputStream();
    ContentHandler textHandler = new BodyContentHandler(outputstream);
    Metadata metadata = new Metadata();
    // Set<String> langs = LanguageIdentifier.getSupportedLanguages();
    // metadata.set(Metadata.CONTENT_LANGUAGE, lang);
    // metadata.set(Metadata.FORMAT, hint);
    ParseContext context = new ParseContext();
    try {
        parser.parse(stream, textHandler, metadata, context);
        String extractedText = outputstream.toString();
        return extractedText;
    } catch (IOException e) {
        e.printStackTrace();
    } catch (SAXException e) {
        e.printStackTrace();
    } catch (TikaException e) {
        e.printStackTrace();
    }

If the input is a doc file that contains Chinese characters, each Chinese characters will be extracted as "?".

Thanks a lot!

Gagravarr · Accepted Answer

Apache Tika is able to extract unicode text from its supported file formats. As long as the file format can store unicode text (eg Chinese or Japanese characters), Apache Tika can extract it

Tika also includes a number of unit tests for this, which verify it works. One such test uses this sample chinese email. If with use the command line Tika app, and grab the first few lines, we see it working:

$ java -jar tika-app-1.4.jar --text testMSG_chinese.msg | head
Alfresco MSG format testing ( MSG 格式測試 )
    From
    Tests Chang@FT (張毓倫)
    To
    Tests Chang@FT (張毓倫)
    Recipients
    tests.chang@fengttt.com

Or with this Japanese document:

$ java -jar tika-app-1.4.jar --text testRTFJapanese.rtf | head -2
ゾルゲの処刑記録、
ゾルゲと尾崎、淡々と最期

You'll just need to ensure that any text output you generate gets stored in a suitable encoding (eg utf8), and the font you use to display it supports those glyphs!

Is Apache Tika able to extract foreign languages like Chinese, Japanese?

Tags:

apache

apache-tika

user2182833

1 Answers

Gagravarr

Recent Activity

Donate For Us

Is Apache Tika able to extract foreign languages like Chinese, Japanese?

Tags:

apache

apache-tika

user2182833

1 Answers

Gagravarr

Related questions

Recent Activity

Donate For Us