I'm trying to take html and generate some json that keeps the same structure.
I'm trying to use pandoc, as i've had some success in transforming things from format A to format B using pandoc before.
I'm trying to convert this file:
example.html
<p>Hello guys! What's up?</p>
Using the command:
pandoc -f html -t json example.html
What i expect is something like:
[{ "p": "Hello guys! What's up?"}]
What i get is:
[
{ "Para":
[
{"t": "Str", "c": "Hello"},
{"t": "Space"},
{"t": "Str", "c": "guys!"},
{"t": "Space"},
{"t": "Str", "c": "What's"},
{"t": "Space"},
{"t": "Str", "c": "up?"}
]
}
]
The problem seems to be that when pandoc reads the text content, it separates every word based on the space character and makes an array out of it, while i expected pandoc to understand that the whole string is a single element.
I'm a beginner at pandoc and I've not been able to find out how to tweak that behavior.
Do you have an idea of how I can get the desired output? Do you know another tool that can do this? The tool, or the language it's written in doesn't matter.
Thanks.
Edit: You can test that behavior online on that pandoc online tool.
Edit 2: Workaround. I couldn't find how to do the HTML->JSON conversion with pandoc. As a workaround, i used the suggestion proposed in the comments, and implemented a solution using Himalaya, which is a node package. The result is exactly what i wished for, even though it's not using pandoc.
Pandoc can convert between numerous markup and word processing formats, including, but not limited to, various flavors of Markdown, HTML, LaTeX and Word docx.
Commonly used markup languages include Markdown, ReStructuredText, HTML, LaTex, ePub, and Microsoft Word DOCX. In plain English, Pandoc allows you to convert a bunch of files from one markup language into another one. Typical examples include converting a Markdown file into a presentation, LaTeX, PDF, or even ePub.
Yes, this means that pandoc can convert . docx files to . pdf and . html, but you may be thinking: “Word can export files to .
Currently, the pandoc JSON representation is not very human-readable, but is auto-generated from the Haskell pandoc data types (aka document AST). There is some discussion to change that eventually.
I guess you're looking for something like https://codebeautify.org/xmltojson? There also seem to be plenty of commandline-tools that do that.
Pandoc, It's a tool to convert documents, the json
representation of the document, It's just another representation that Pandoc can handle for the AST (Abstract Syntax Tree)
Original Document --> Pandoc's AST --> Output Document
| |
pandoc pandoc
Asking pandoc, to output a json
, is to ask for the AST tree in it's json
format,
If I understand correctly you would need something more like a xml
to json
converter like this Python xmljson module or an online tool like this one.
There are plenty of tools for that job as you picture it, just google XML to JSON convert.
The json
representation of the AST used in pandoc, it normally used to output it from pandoc, and pipe it into another program that can handle json
files, so you can alter the AST and make filters to manipulate the structure of your document.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With