Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting images and text from an mht file

Tags:

mhtml

I have a mht file that contains images and some text. When I open it with notepad++, I see xml and then illegible text which I think are images. Can somebody tell me how can I extract images and text from an mht file using a java program? Thanks.

like image 219
SomeDude Avatar asked Dec 09 '13 17:12

SomeDude


People also ask

How do I extract an MHT file?

How to extract text and metadata from MHTML files. Click inside the file drop area to upload a MHTML file or drag & drop a MHTML file. Click Get Text and Metadata button to extract text and metadata from your MHTML document. Once your MHTML is processed click on Download Now button.

Can you convert MHT file to PDF?

Once you open your MHT file in a web browser, you can convert it to a PDF using your browser's print menu. To do so, click "File" and then "Print," and then choose the "Print to PDF" option available in your browser and operating system. Click the "Print" button and choose where you want to save the PDF file.


2 Answers

It's a bit old, but Open it in Internet Explorer, and save as HTML also do the job

Update:

If you open the .mht file in IE, then save it, with the "Save as type" set to "Webpage, complete (.htm;.html)", then it will create the 'filename.htm' file, as well as a 'filename_files' directory. In that directory will be a lot of .tmp files. For output from the MS "Problem Steps Recorder", these will include among them a bunch of files with '(1)' in the name (as in there might be a 'mhtD3B8.tmp' file as well as a 'mhtD3B8(1).tmp' file). The '(1)' files are the images, in .jpg format, simply with a .tmp extension. Search for all the files with '(1)' in the name from that folder, and copy them to a different directory.

Once in the new directory, open a cmd window pointed there. To change all the extensions at once, type "rename *.tmp *.jpg" (without the quotes) and press Enter. Voila - all the image files are extracted.

As for accessing the text - since the file is now saved as a .htm file, you should be able to open that file in Notepad++ and parse/read it properly there.

Hope this helps!

like image 147
Calimero100582 Avatar answered Oct 21 '22 17:10

Calimero100582


There's an open-source perl tool called unmht which should do the job:

The first HTML file in the archive is taken to be the primary web page, the other contained files for "page requisites" such as images or frames. The primary web page is written to the output directory (the current directory by default), the requisites to a subdirectory named after the primary HTML file name without extension, with "_files" appended. Link URLs in all HTML files referring to requisites are rewritten to point to the saved files.

like image 20
zb226 Avatar answered Oct 21 '22 15:10

zb226