I'd like to use JavaScript to parse an html document into an abstract syntax tree, where each node also includes start and end line numbers (and hopefully also character positions) for each node. Are there any existing solutions that can do this? I don't want to have to write it myself.
Edit Apr 24, 2016: Being able to parse HTML along with php tags in arbitrary places would be even more ideal.
HTML parsing involves tokenization and tree construction. HTML tokens include start and end tags, as well as attribute names and values. If the document is well-formed, parsing it is straightforward and faster. The parser parses tokenized input into the document, building up the document tree.
jsoup can parse HTML files, input streams, URLs, or even strings. It eases data extraction from HTML by offering Document Object Model (DOM) traversal methods and CSS and jQuery-like selectors. jsoup can manipulate the content: the HTML element itself, its attributes, or its text.
An Abstract Syntax Tree, or AST, is a tree representation of the source code of a computer program that conveys the structure of the source code. Each node in the tree represents a construct occurring in the source code.
https://unifiedjs.github.io/ can get you the CST or AST for a few formats including HTML.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With