For argument's sake lets assume a HTML parser.
I've read that it tokenizes everything first, and then parses it.
What does tokenize mean?
Does the parser read every character each, building up a multi dimensional array to store the structure?
For example, does it read a <
and then begin to capture the element, and then once it meets a closing >
(outside of an attribute) it is pushed onto a array stack somewhere?
I'm interested for the sake of knowing (I'm curious).
If I were to read through the source of something like HTML Purifier, would that give me a good idea of how HTML is parsed?
A parser is a compiler or interpreter component that breaks data into smaller elements for easy translation into another language. A parser takes input in the form of a sequence of tokens, interactive commands, or program instructions and breaks them up into parts that can be used by other components in programming.
The best performers are Golang and C with very similar results. Python LIBXML2 performs fairly well. Ruby speed is similar to Python. Java parser tested is slower.
Lxml. lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language.
Tokenizing can be composed of a few steps, for example, if you have this html code:
<html> <head> <title>My HTML Page</title> </head> <body> <p style="special"> This paragraph has special style </p> <p> This paragraph is not special </p> </body> </html>
the tokenizer may convert that string to a flat list of significant tokens, discarding whitespaces (thanks, SasQ for the correction):
["<", "html", ">", "<", "head", ">", "<", "title", ">", "My HTML Page", "</", "title", ">", "</", "head", ">", "<", "body", ">", "<", "p", "style", "=", "\"", "special", "\"", ">", "This paragraph has special style", "</", "p", ">", "<", "p", ">", "This paragraph is not special", "</", "p", ">", "</", "body", ">", "</", "html", ">" ]
there may be multiple tokenizing passes to convert a list of tokens to a list of even higher-level tokens like the following hypothetical HTML parser might do (which is still a flat list):
[("<html>", {}), ("<head>", {}), ("<title>", {}), "My HTML Page", "</title>", "</head>", ("<body>", {}), ("<p>", {"style": "special"}), "This paragraph has special style", "</p>", ("<p>", {}), "This paragraph is not special", "</p>", "</body>", "</html>" ]
then the parser converts that list of tokens to form a tree or graph that represents the source text in a manner that is more convenient to access/manipulate by the program:
("<html>", {}, [ ("<head>", {}, [ ("<title>", {}, ["My HTML Page"]), ]), ("<body>", {}, [ ("<p>", {"style": "special"}, ["This paragraph has special style"]), ("<p>", {}, ["This paragraph is not special"]), ]), ])
at this point, the parsing is complete; and it is then up to the user to interpret the tree, modify it, etc.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With