Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is Apache Tika able to extract foreign languages like Chinese, Japanese?

Is Apache Tika able to extract foreign languages like Chinese, Japanese?

I have the following code:

    Detector detector = new DefaultDetector();
    Parser parser = new AutoDetectParser(detector);
    InputStream stream = new ByteArrayInputStream(bytes);
    OutputStream outputstream = new ByteArrayOutputStream();
    ContentHandler textHandler = new BodyContentHandler(outputstream);
    Metadata metadata = new Metadata();
    // Set<String> langs = LanguageIdentifier.getSupportedLanguages();
    // metadata.set(Metadata.CONTENT_LANGUAGE, lang);
    // metadata.set(Metadata.FORMAT, hint);
    ParseContext context = new ParseContext();
    try {
        parser.parse(stream, textHandler, metadata, context);
        String extractedText = outputstream.toString();
        return extractedText;
    } catch (IOException e) {
        e.printStackTrace();
    } catch (SAXException e) {
        e.printStackTrace();
    } catch (TikaException e) {
        e.printStackTrace();
    }

If the input is a doc file that contains Chinese characters, each Chinese characters will be extracted as "?".

Thanks a lot!

like image 477
user2182833 Avatar asked Nov 04 '22 01:11

user2182833


1 Answers

Apache Tika is able to extract unicode text from its supported file formats. As long as the file format can store unicode text (eg Chinese or Japanese characters), Apache Tika can extract it

Tika also includes a number of unit tests for this, which verify it works. One such test uses this sample chinese email. If with use the command line Tika app, and grab the first few lines, we see it working:

$ java -jar tika-app-1.4.jar --text testMSG_chinese.msg | head
Alfresco MSG format testing ( MSG 格式測試 )
    From
    Tests Chang@FT (張毓倫)
    To
    Tests Chang@FT (張毓倫)
    Recipients
    [email protected]

Or with this Japanese document:

$ java -jar tika-app-1.4.jar --text testRTFJapanese.rtf | head -2
ゾルゲの処刑記録、
ゾルゲと尾崎、淡々と最期 

You'll just need to ensure that any text output you generate gets stored in a suitable encoding (eg utf8), and the font you use to display it supports those glyphs!

like image 186
Gagravarr Avatar answered Nov 15 '22 08:11

Gagravarr