I need to mine the content of most of known document files like:
For most of these file formats I am planning to use:
http://tika.apache.org/
But as of now Tika
does not support MHTML (*.mht) files.. ( http://en.wikipedia.org/wiki/MHTML )
There are few examples in C# ( http://www.codeproject.com/KB/files/MhtBuilder.aspx ) but I found none in Java.
I tried opening the *.mht file in 7Zip and it failed...Although the WinZip was able to decompress the file into images and text (CSS, HTML, Script) as text and binary files...
As per MSDN page ( http://msdn.microsoft.com/en-us/library/aa767785%28VS.85%29.aspx#compress_content ) and the code project
page i mentioned earlier ... mht files use GZip compression ....
Attempting to decompress in java results in following exceptions:
With java.uti.zip.GZIPInputStream
java.io.IOException: Not in GZIP format
at java.util.zip.GZIPInputStream.readHeader(Unknown Source)
at java.util.zip.GZIPInputStream.<init>(Unknown Source)
at java.util.zip.GZIPInputStream.<init>(Unknown Source)
at GZipTest.main(GZipTest.java:16)
And with java.util.zip.ZipFile
java.util.zip.ZipException: error in opening zip file
at java.util.zip.ZipFile.open(Native Method)
at java.util.zip.ZipFile.<init>(Unknown Source)
at java.util.zip.ZipFile.<init>(Unknown Source)
at GZipTest.main(GZipTest.java:21)
Kindly suggest how to decompress it....
Thanks....
How to Open MHT Files. Probably the easiest way to open MHT files is to use a web browser like Chrome, Opera, Edge, or Internet Explorer. You can also view one in Microsoft Word and WPS Writer. HTML editors support the format as well, like WizHtmlEditor and BlockNote.
How to extract text and images from MHTML. Upload MHTML files to extract text and images online. Specify the parameters and press the "PARSE NOW" button to parse MHTML. Download the parsed MHTML to view instantly or send a link to an email.
Python developers can easily load & convert MHTML files to HTML in just a few lines of code.
A more compact code using Java Mail APIs
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.net.URL;
import java.util.Properties;
import javax.mail.BodyPart;
import javax.mail.Session;
import javax.mail.internet.MimeMessage;
import javax.mail.internet.MimeMultipart;
import org.apache.commons.io.IOUtils;
public class MhtParser {
private File mhtFile;
private File outputFolder;
public MhtParser(File mhtFile, File outputFolder) {
this.mhtFile = mhtFile;
this.outputFolder = outputFolder;
}
public void decompress() throws Exception {
MimeMessage message =
new MimeMessage(
Session.getDefaultInstance(new Properties(), null),
new FileInputStream(mhtFile));
if (message.getContent() instanceof MimeMultipart) {
outputFolder.mkdir();
MimeMultipart mimeMultipart = (MimeMultipart) message.getContent();
for (int i = 0; i < mimeMultipart.getCount(); i++) {
BodyPart bodyPart = mimeMultipart.getBodyPart(i);
String fileName = bodyPart.getFileName();
if (fileName == null) {
String[] locationHeader = bodyPart.getHeader("Content-Location");
if (locationHeader != null && locationHeader.length > 0) {
fileName =
new File(new URL(locationHeader[0]).getFile()).getName();
}
}
if (fileName != null) {
FileOutputStream out =
new FileOutputStream(new File(outputFolder, fileName));
IOUtils.copy(bodyPart.getInputStream(), out);
out.flush();
out.close();
}
}
}
}
}
You don't have to do it on you own.
With dependency
<dependency>
<groupId>org.apache.james</groupId>
<artifactId>apache-mime4j</artifactId>
<version>0.7.2</version>
</dependency>
Roll you mht file
public static void main(String[] args)
{
MessageTree.main(new String[]{"YOU MHT FILE PATH"});
}
MessageTree
will
/**
* Displays a parsed Message in a window. The window will be divided into
* two panels. The left panel displays the Message tree. Clicking on a
* node in the tree shows information on that node in the right panel.
*
* Some of this code have been copied from the Java tutorial's JTree section.
*/
Then you can look into it.
;-)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With