Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why pandoc keeps span and div tags when converting html to markdown?

I'm a pandoc newbie, so I must be missing something obvious. I'm trying to convert MS Word generated HTML file to markdown. Here is a test html:

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <title></title>
</head>
<body>
  <div class="Section1">
    <p class="Question"><span style="FONT-SIZE: 10pt">Today</span> <span style=
    "FONT-SIZE: 10pt">is</span> <span lang="HR" style=
    "FONT-SIZE: 10pt; mso-ansi-language: HR">a</span><span style=
    "FONT-SIZE: 10pt">nice</span> <span style="FONT-SIZE: 10pt">day</span> 
    </p>
  </div>
</body>
</html>

and I try to convert it with:

pandoc -f html -t markdown test.html -o test.md

I was expecting "Today is a nice day", but got:

<div class="Section1">

<span style="FONT-SIZE: 10pt">Today</span> <span
style="FONT-SIZE: 10pt">is</span> <span lang="HR"
style="FONT-SIZE: 10pt; mso-ansi-language: HR">a</span><span
style="FONT-SIZE: 10pt">nice</span> <span
style="FONT-SIZE: 10pt">day</span>

</div>

Why was the div kept? Why were the spans kept?

like image 765
igorludi Avatar asked Mar 04 '16 22:03

igorludi


People also ask

Can pandoc convert HTML to Markdown?

Pandoc can convert between numerous markup and word processing formats, including, but not limited to, various flavors of Markdown, HTML, LaTeX and Word docx.

Can pandoc convert PDF to Markdown?

You can use the program pandoc on the SCF Linux and Mac machines (via the terminal window) to convert from formats such as HTML, LaTeX and Markdown to formats such as HTML, LaTeX, Word, OpenOffice, and PDF, among others.

What is the use of div and SPAN tag in HTML?

div in HTML. Span and div are both generic HTML elements that group together related parts of a web page. However, they serve different functions. A div element is used for block-level organization and styling of page elements, whereas a span element is used for inline organization and styling.


1 Answers

You need to turn off some extensions. Either on the HTML input side:

$ pandoc -f html-native_divs-native_spans -t markdown test.html -o test.md

Or on the markdown output side:

$ pandoc -f html -t markdown-raw_html-native_divs-native_spans-fenced_divs-bracketed_spans test.html -o test.md
like image 80
mb21 Avatar answered Sep 18 '22 15:09

mb21