Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does HTML Parsing mean? [closed]

I have heard of HTML Parser libraries like Simple HTML DOM and HTML Parser. I have also heard of questions containing HTML Parsing. What does it mean to parse HTML?

like image 913
LightningBoltϟ Avatar asked Dec 06 '13 10:12

LightningBoltϟ


People also ask

What does parsing mean in HTML?

Parsing means analyzing and converting a program into an internal format that a runtime environment can actually run, for example the JavaScript engine inside browsers. The browser parses HTML into a DOM tree. HTML parsing involves tokenization and tree construction.

What is HTML parsing error?

Parse errors are only errors with the syntax of HTML. In addition to checking for parse errors, conformance checkers will also verify that the document obeys all the other conformance requirements described in this specification.

How do you parse an element in HTML?

By using DOMParser you can easily parse the HTML document. Usually, you have to resort to trick the browser into parsing it for you, for instance by adding a new element to the current document. domParser = new DOMParser(); doc = domParser.


2 Answers

Unlike what Spudley said, parsing is basically to resolve (a sentence) into its component parts and describe their syntactic roles.

According to wikipedia, Parsing or syntactic analysis is the process of analysing a string of symbols, either in natural language or in computer languages, according to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part (of speech).

In your case, HTML parsing is basically: taking in HTML code and extracting relevant information like the title of the page, paragraphs in the page, headings in the page, links, bold text etc.

Parsers:

A computer program that parses content is called a parser. There are in general 2 kinds of parsers:

Top-down parsing- Top-down parsing can be viewed as an attempt to find left-most derivations of an input-stream by searching for parse trees using a top-down expansion of the given formal grammar rules. Tokens are consumed from left to right. Inclusive choice is used to accommodate ambiguity by expanding all alternative right-hand-sides of grammar rules.

Bottom-up parsing - A parser can start with the input and attempt to rewrite it to the start symbol. Intuitively, the parser attempts to locate the most basic elements, then the elements containing these, and so on. LR parsers are examples of bottom-up parsers. Another term used for this type of parser is Shift-Reduce parsing.

A few example parsers:

Top-down parsers:

  • Recursive descent parser
  • LL parser (Left-to-right, Leftmost derivation)
  • Earley parser

Bottom-up parsers:

  • Precedence parser
    • Operator-precedence parser
    • Simple precedence parser
  • BC (bounded context) parsing
  • LR parser (Left-to-right, Rightmost derivation)
    • Simple LR (SLR) parser
    • LALR parser
    • Canonical LR (LR(1)) parser
    • GLR parser
  • CYK parser
  • Recursive ascent parser

Example parser:

Here's an example HTML parser in python:

from HTMLParser import HTMLParser

# create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print "Encountered a start tag:", tag
    def handle_endtag(self, tag):
        print "Encountered an end tag :", tag
    def handle_data(self, data):
        print "Encountered some data  :", data

# instantiate the parser and fed it some HTML
parser = MyHTMLParser()
parser.feed('<html><head><title>Test</title></head>'
            '<body><h1>Parse me!</h1></body></html>')

Here's the output:

Encountered a start tag: html
Encountered a start tag: head
Encountered a start tag: title
Encountered some data  : Test
Encountered an end tag : title
Encountered an end tag : head
Encountered a start tag: body
Encountered a start tag: h1
Encountered some data  : Parse me!
Encountered an end tag : h1
Encountered an end tag : body
Encountered an end tag : html

References

  • Wikipedia
  • Python docs
like image 191
Anshu Dwibhashi Avatar answered Sep 23 '22 09:09

Anshu Dwibhashi


Parsing in general applies to any computer language, and is the process of taking the code as text and producing a structure in memory that the computer can understand and work with.

Specifically for HTML, HTML parsing is the process of taking raw HTML code, reading it, and generating a DOM tree object structure from it.

like image 25
Spudley Avatar answered Sep 23 '22 09:09

Spudley