Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse an html document into an AST that includes line numbers for each node?

I'd like to use JavaScript to parse an html document into an abstract syntax tree, where each node also includes start and end line numbers (and hopefully also character positions) for each node. Are there any existing solutions that can do this? I don't want to have to write it myself.

Edit Apr 24, 2016: Being able to parse HTML along with php tags in arbitrary places would be even more ideal.

like image 719
EricP Avatar asked Oct 13 '14 21:10

EricP


People also ask

How do you parse HTML?

HTML parsing involves tokenization and tree construction. HTML tokens include start and end tags, as well as attribute names and values. If the document is well-formed, parsing it is straightforward and faster. The parser parses tokenized input into the document, building up the document tree.

Which library can be used to parse HTML code?

jsoup can parse HTML files, input streams, URLs, or even strings. It eases data extraction from HTML by offering Document Object Model (DOM) traversal methods and CSS and jQuery-like selectors. jsoup can manipulate the content: the HTML element itself, its attributes, or its text.

What are AST nodes?

An Abstract Syntax Tree, or AST, is a tree representation of the source code of a computer program that conveys the structure of the source code. Each node in the tree represents a construct occurring in the source code.


1 Answers

https://unifiedjs.github.io/ can get you the CST or AST for a few formats including HTML.

like image 173
Michael buller Avatar answered Oct 01 '22 00:10

Michael buller