Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read or parse MHTML (.mht) files in java

I need to mine the content of most of known document files like:

  1. pdf
  2. html
  3. doc/docx etc.

For most of these file formats I am planning to use:

http://tika.apache.org/

But as of now Tika does not support MHTML (*.mht) files.. ( http://en.wikipedia.org/wiki/MHTML ) There are few examples in C# ( http://www.codeproject.com/KB/files/MhtBuilder.aspx ) but I found none in Java.

I tried opening the *.mht file in 7Zip and it failed...Although the WinZip was able to decompress the file into images and text (CSS, HTML, Script) as text and binary files...

As per MSDN page ( http://msdn.microsoft.com/en-us/library/aa767785%28VS.85%29.aspx#compress_content ) and the code project page i mentioned earlier ... mht files use GZip compression ....

Attempting to decompress in java results in following exceptions: With java.uti.zip.GZIPInputStream

java.io.IOException: Not in GZIP format
at java.util.zip.GZIPInputStream.readHeader(Unknown Source)
at java.util.zip.GZIPInputStream.<init>(Unknown Source)
at java.util.zip.GZIPInputStream.<init>(Unknown Source)
at GZipTest.main(GZipTest.java:16)

And with java.util.zip.ZipFile

 java.util.zip.ZipException: error in opening zip file
at java.util.zip.ZipFile.open(Native Method)
at java.util.zip.ZipFile.<init>(Unknown Source)
at java.util.zip.ZipFile.<init>(Unknown Source)
at GZipTest.main(GZipTest.java:21)

Kindly suggest how to decompress it....

Thanks....

like image 366
Favonius Avatar asked Jul 12 '10 16:07

Favonius


People also ask

How do I open a .MHT file?

How to Open MHT Files. Probably the easiest way to open MHT files is to use a web browser like Chrome, Opera, Edge, or Internet Explorer. You can also view one in Microsoft Word and WPS Writer. HTML editors support the format as well, like WizHtmlEditor and BlockNote.

How do I extract an image from MHTML?

How to extract text and images from MHTML. Upload MHTML files to extract text and images online. Specify the parameters and press the "PARSE NOW" button to parse MHTML. Download the parsed MHTML to view instantly or send a link to an email.

Can we convert MHTML to HTML?

Python developers can easily load & convert MHTML files to HTML in just a few lines of code.


2 Answers

A more compact code using Java Mail APIs

import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.net.URL;
import java.util.Properties;

import javax.mail.BodyPart;
import javax.mail.Session;
import javax.mail.internet.MimeMessage;
import javax.mail.internet.MimeMultipart;

import org.apache.commons.io.IOUtils;

public class MhtParser {

    private File mhtFile;
    private File outputFolder;

    public MhtParser(File mhtFile, File outputFolder) {
        this.mhtFile = mhtFile;
        this.outputFolder = outputFolder;
    }

    public void decompress() throws Exception {
        MimeMessage message = 
            new MimeMessage(
                    Session.getDefaultInstance(new Properties(), null),
                    new FileInputStream(mhtFile));

        if (message.getContent() instanceof MimeMultipart) {
            outputFolder.mkdir();
            MimeMultipart mimeMultipart = (MimeMultipart) message.getContent();

            for (int i = 0; i < mimeMultipart.getCount(); i++) {
                BodyPart bodyPart = mimeMultipart.getBodyPart(i);
                String fileName = bodyPart.getFileName();

                if (fileName == null) {
                    String[] locationHeader = bodyPart.getHeader("Content-Location");
                    if (locationHeader != null && locationHeader.length > 0) {
                        fileName = 
                            new File(new URL(locationHeader[0]).getFile()).getName();
                    }
                }

                if (fileName != null) {
                    FileOutputStream out = 
                        new FileOutputStream(new File(outputFolder, fileName));

                    IOUtils.copy(bodyPart.getInputStream(), out);
                    out.flush();
                    out.close();
                }
            }
        }
    }
}
like image 104
rakesh Avatar answered Sep 17 '22 22:09

rakesh


You don't have to do it on you own.

With dependency

<dependency>
    <groupId>org.apache.james</groupId>
    <artifactId>apache-mime4j</artifactId>
    <version>0.7.2</version>
</dependency>

Roll you mht file

public static void main(String[] args)
{
    MessageTree.main(new String[]{"YOU MHT FILE PATH"});
}

MessageTree will

/**
 * Displays a parsed Message in a window. The window will be divided into
 * two panels. The left panel displays the Message tree. Clicking on a
 * node in the tree shows information on that node in the right panel.
 *
 * Some of this code have been copied from the Java tutorial's JTree section.
 */

Then you can look into it.

;-)

like image 35
wener Avatar answered Sep 19 '22 22:09

wener