Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting html to json with pandoc

I'm trying to take html and generate some json that keeps the same structure.

I'm trying to use pandoc, as i've had some success in transforming things from format A to format B using pandoc before.

I'm trying to convert this file:

example.html

<p>Hello guys! What's up?</p>

Using the command:

pandoc -f html -t json example.html

What i expect is something like:

[{ "p": "Hello guys! What's up?"}]

What i get is:

[
  { "Para":
    [
      {"t": "Str", "c": "Hello"},
      {"t": "Space"},
      {"t": "Str", "c": "guys!"},
      {"t": "Space"},
      {"t": "Str", "c": "What's"},
      {"t": "Space"},
      {"t": "Str", "c": "up?"}
    ]
  }
]

The problem seems to be that when pandoc reads the text content, it separates every word based on the space character and makes an array out of it, while i expected pandoc to understand that the whole string is a single element.

I'm a beginner at pandoc and I've not been able to find out how to tweak that behavior.

Do you have an idea of how I can get the desired output? Do you know another tool that can do this? The tool, or the language it's written in doesn't matter.

Thanks.

Edit: You can test that behavior online on that pandoc online tool.

Edit 2: Workaround. I couldn't find how to do the HTML->JSON conversion with pandoc. As a workaround, i used the suggestion proposed in the comments, and implemented a solution using Himalaya, which is a node package. The result is exactly what i wished for, even though it's not using pandoc.

like image 269
Loïc N. Avatar asked Sep 21 '18 08:09

Loïc N.


People also ask

Can pandoc convert HTML to markdown?

Pandoc can convert between numerous markup and word processing formats, including, but not limited to, various flavors of Markdown, HTML, LaTeX and Word docx.

Can pandoc convert PDF to markdown?

Commonly used markup languages include Markdown, ReStructuredText, HTML, LaTex, ePub, and Microsoft Word DOCX. In plain English, Pandoc allows you to convert a bunch of files from one markup language into another one. Typical examples include converting a Markdown file into a presentation, LaTeX, PDF, or even ePub.

Can pandoc convert PDF to HTML?

Yes, this means that pandoc can convert . docx files to . pdf and . html, but you may be thinking: “Word can export files to .


2 Answers

Currently, the pandoc JSON representation is not very human-readable, but is auto-generated from the Haskell pandoc data types (aka document AST). There is some discussion to change that eventually.

I guess you're looking for something like https://codebeautify.org/xmltojson? There also seem to be plenty of commandline-tools that do that.

like image 170
mb21 Avatar answered Oct 05 '22 13:10

mb21


Pandoc, It's a tool to convert documents, the json representation of the document, It's just another representation that Pandoc can handle for the AST (Abstract Syntax Tree)

Original Document --> Pandoc's AST --> Output Document
                   |                |
                pandoc           pandoc

Asking pandoc, to output a json, is to ask for the AST tree in it's json format,

If I understand correctly you would need something more like a xml to json converter like this Python xmljson module or an online tool like this one.

There are plenty of tools for that job as you picture it, just google XML to JSON convert.

The json representation of the AST used in pandoc, it normally used to output it from pandoc, and pipe it into another program that can handle json files, so you can alter the AST and make filters to manipulate the structure of your document.

like image 36
ekiim Avatar answered Oct 05 '22 11:10

ekiim