Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does a parser (for example, HTML) work?

For argument's sake lets assume a HTML parser.

I've read that it tokenizes everything first, and then parses it.

What does tokenize mean?

Does the parser read every character each, building up a multi dimensional array to store the structure?

For example, does it read a < and then begin to capture the element, and then once it meets a closing > (outside of an attribute) it is pushed onto a array stack somewhere?

I'm interested for the sake of knowing (I'm curious).

If I were to read through the source of something like HTML Purifier, would that give me a good idea of how HTML is parsed?

like image 959
alex Avatar asked Jun 30 '10 14:06

alex


People also ask

How does a parser work?

A parser is a compiler or interpreter component that breaks data into smaller elements for easy translation into another language. A parser takes input in the form of a sequence of tokens, interactive commands, or program instructions and breaks them up into parts that can be used by other components in programming.

What is the best HTML parser?

The best performers are Golang and C with very similar results. Python LIBXML2 performs fairly well. Ruby speed is similar to Python. Java parser tested is slower.

Which library can be used to parse HTML & XML?

Lxml. lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language.


1 Answers

Tokenizing can be composed of a few steps, for example, if you have this html code:

<html>     <head>         <title>My HTML Page</title>     </head>     <body>         <p style="special">             This paragraph has special style         </p>         <p>             This paragraph is not special         </p>     </body> </html> 

the tokenizer may convert that string to a flat list of significant tokens, discarding whitespaces (thanks, SasQ for the correction):

["<", "html", ">",       "<", "head", ">",           "<", "title", ">", "My HTML Page", "</", "title", ">",      "</", "head", ">",      "<", "body", ">",          "<", "p", "style", "=", "\"", "special", "\"", ">",             "This paragraph has special style",         "</", "p", ">",         "<", "p", ">",             "This paragraph is not special",         "</", "p", ">",     "</", "body", ">", "</", "html", ">" ] 

there may be multiple tokenizing passes to convert a list of tokens to a list of even higher-level tokens like the following hypothetical HTML parser might do (which is still a flat list):

[("<html>", {}),       ("<head>", {}),           ("<title>", {}), "My HTML Page", "</title>",      "</head>",      ("<body>", {}),         ("<p>", {"style": "special"}),             "This paragraph has special style",         "</p>",         ("<p>", {}),             "This paragraph is not special",         "</p>",     "</body>", "</html>" ] 

then the parser converts that list of tokens to form a tree or graph that represents the source text in a manner that is more convenient to access/manipulate by the program:

("<html>", {}, [     ("<head>", {}, [         ("<title>", {}, ["My HTML Page"]),     ]),      ("<body>", {}, [         ("<p>", {"style": "special"}, ["This paragraph has special style"]),         ("<p>", {}, ["This paragraph is not special"]),     ]), ]) 

at this point, the parsing is complete; and it is then up to the user to interpret the tree, modify it, etc.

like image 163
Lie Ryan Avatar answered Sep 19 '22 04:09

Lie Ryan