Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a decent, customisable, HTML to Markdown Java API?

I want to save text I scrape from various sources without the HTML tags that are on it, but also keeping as much of the structure as I reasonably can.

Markdown seems to be the solution to this (or possibly MultiMarkdown).

There is a question which offers a suggestion on converting from HTML to markdown, but I want to specify some specific things:

  • ALL links (including images) are referenced at the END only (i.e. no inline urls)
  • NO embeded HTML (I'm not even 100% sure yet how I'd like to deal with difficult HTML... but it won't be embeded!)

So my question is as stated in the title: Is there a decent, customisable, HTML to Markdown Java API?

like image 835
barryred Avatar asked Nov 01 '10 17:11

barryred


People also ask

Can you convert html to markdown?

We can easily convert HTML to markdown using markdownify package.


1 Answers

You could try adapting HtmlCleaner which provides a workable interface onto the DOM:

TagNode root = htmlCleaner.clean( stream );
Object[] found = root.evaluateXPath( "//div[id='something']" );
if( found.length > 0 && found instanceof TagNode ) {
    ((TagNode)found[0]).removeFromTree();
}

This would allow you to structure your output stream in any format that you want using a fairly simple API.

like image 66
Gary Rowe Avatar answered Oct 14 '22 13:10

Gary Rowe