<p>I have heard of HTML Parser libraries like Simple HTML DOM and HTML Parser. I have also heard of questions containing HTML Parsing. What does it mean to parse HTML?</p>

<p>Unlike what Spudley said, parsing is basically to <b>resolve (a sentence) into its component parts and describe their syntactic roles.</b></p> <p>According to wikipedia, Parsing or syntactic analysis is the process of analysing a string of symbols, either in <strong>natural language</strong> or in <strong>computer languages</strong>, according to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part (of speech).</p> <p>In your case, HTML parsing is basically: taking in HTML code and extracting relevant information like the title of the page, paragraphs in the page, headings in the page, links, bold text etc. </p> <h3> Parsers: </h3> <p>A computer program that parses content is called a parser. There are in general 2 kinds of parsers:</p> <p><strong>Top-down parsing</strong>- Top-down parsing can be viewed as an attempt to find left-most derivations of an input-stream by searching for parse trees using a top-down expansion of the given formal grammar rules. Tokens are consumed from left to right. Inclusive choice is used to accommodate ambiguity by expanding all alternative right-hand-sides of grammar rules.</p> <p><strong>Bottom-up parsing</strong> - A parser can start with the input and attempt to rewrite it to the start symbol. Intuitively, the parser attempts to locate the most basic elements, then the elements containing these, and so on. LR parsers are examples of bottom-up parsers. Another term used for this type of parser is Shift-Reduce parsing.</p> <h3>A few example parsers:</h3> <h3>Top-down parsers:</h3> <ul> <li>Recursive descent parser</li> <li>LL parser (Left-to-right, Leftmost derivation)</li> <li>Earley parser</li> </ul> <h3>Bottom-up parsers:</h3> <ul> <li>Precedence parser <ul> <li>Operator-precedence parser</li> <li>Simple precedence parser</li> </ul> </li> <li>BC (bounded context) parsing</li> <li> LR parser (<b>L</b>eft-to-right, <b>R</b>ightmost derivation) <ul> <li>Simple LR (SLR) parser</li> <li>LALR parser</li> <li>Canonical LR (LR(1)) parser</li> <li>GLR parser</li> </ul> </li> <li>CYK parser</li> <li>Recursive ascent parser</li> </ul> <h3>Example parser:</h3> <p>Here's an example HTML parser in python:</p> <pre class="prettyprint"><code>from HTMLParser import HTMLParser # create a subclass and override the handler methods class MyHTMLParser(HTMLParser): def handle_starttag(self, tag, attrs): print "Encountered a start tag:", tag def handle_endtag(self, tag): print "Encountered an end tag :", tag def handle_data(self, data): print "Encountered some data :", data # instantiate the parser and fed it some HTML parser = MyHTMLParser() parser.feed('<html><head><title>Test</title></head>' '<body><h1>Parse me!</h1></body></html>') </code></pre> <p>Here's the output:</p> <blockquote> <pre class="prettyprint"><code>Encountered a start tag: html Encountered a start tag: head Encountered a start tag: title Encountered some data : Test Encountered an end tag : title Encountered an end tag : head Encountered a start tag: body Encountered a start tag: h1 Encountered some data : Parse me! Encountered an end tag : h1 Encountered an end tag : body Encountered an end tag : html </code></pre> </blockquote> <h3>References</h3> <ul> <li>Wikipedia</li> <li>Python docs</li> </ul>

<p>Parsing in general applies to any computer language, and is the process of taking the code as text and producing a structure in memory that the computer can understand and work with.</p> <p>Specifically for HTML, HTML parsing is the process of taking raw HTML code, reading it, and generating a DOM tree object structure from it.</p>

What does HTML Parsing mean? [closed]

2 Answers

Unlike what Spudley said, parsing is basically to resolve (a sentence) into its component parts and describe their syntactic roles.

According to wikipedia, Parsing or syntactic analysis is the process of analysing a string of symbols, either in natural language or in computer languages, according to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part (of speech).

In your case, HTML parsing is basically: taking in HTML code and extracting relevant information like the title of the page, paragraphs in the page, headings in the page, links, bold text etc.

Parsers:

A computer program that parses content is called a parser. There are in general 2 kinds of parsers:

Top-down parsing- Top-down parsing can be viewed as an attempt to find left-most derivations of an input-stream by searching for parse trees using a top-down expansion of the given formal grammar rules. Tokens are consumed from left to right. Inclusive choice is used to accommodate ambiguity by expanding all alternative right-hand-sides of grammar rules.

Bottom-up parsing - A parser can start with the input and attempt to rewrite it to the start symbol. Intuitively, the parser attempts to locate the most basic elements, then the elements containing these, and so on. LR parsers are examples of bottom-up parsers. Another term used for this type of parser is Shift-Reduce parsing.

A few example parsers:

Top-down parsers:

Recursive descent parser
LL parser (Left-to-right, Leftmost derivation)
Earley parser

Bottom-up parsers:

Precedence parser
- Operator-precedence parser
- Simple precedence parser
BC (bounded context) parsing
LR parser (Left-to-right, Rightmost derivation)
- Simple LR (SLR) parser
- LALR parser
- Canonical LR (LR(1)) parser
- GLR parser
CYK parser
Recursive ascent parser

Example parser:

Here's an example HTML parser in python:

from HTMLParser import HTMLParser

# create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print "Encountered a start tag:", tag
    def handle_endtag(self, tag):
        print "Encountered an end tag :", tag
    def handle_data(self, data):
        print "Encountered some data  :", data

# instantiate the parser and fed it some HTML
parser = MyHTMLParser()
parser.feed('<html><head><title>Test</title></head>'
            '<body><h1>Parse me!</h1></body></html>')

Here's the output:

Encountered a start tag: html
Encountered a start tag: head
Encountered a start tag: title
Encountered some data  : Test
Encountered an end tag : title
Encountered an end tag : head
Encountered a start tag: body
Encountered a start tag: h1
Encountered some data  : Parse me!
Encountered an end tag : h1
Encountered an end tag : body
Encountered an end tag : html

References

Wikipedia
Python docs

191

answered Sep 23 '22 09:09

Anshu Dwibhashi

Parsing in general applies to any computer language, and is the process of taking the code as text and producing a structure in memory that the computer can understand and work with.

Specifically for HTML, HTML parsing is the process of taking raw HTML code, reading it, and generating a DOM tree object structure from it.

answered Sep 23 '22 09:09

Spudley

Related questions
                            
                                HTML5 and CSS3 for IE7 and IE8
                            
                                How to get html to print return value of javascript function?
                            
                                Showing only XML files in HTML file input element
                            
                                Showing a demo of my CSS on any website
                            
                                jQuery given input ID, get label text
                            
                                How to place a text next to the picture? [closed]
                            
                                Is it possible to not load an iframe in a hidden div, until the div is displayed?
                            
                                Responsive images in tables (bootstrap 3)
                            
                                AngularJS Changing <body> class using global variable
                            
                                Ring-shaped process spinner with fading gradient effect around the ring
                            
                                alt and title not showing up as tooltip for svg path
                            
                                Move Element to another div without losing events, listeners, etc (without jQuery)
                            
                                Change Navbar breakpoint in Bootstrap 3.3.2 [duplicate]
                            
                                Slick Slider Next Arrows not showing
                            
                                Horizontal line in the middle of divs
                            
                                Can I use "Arial Rounded MT Bold" with css?
                            
                                How to lock scrolling of a web page temporarily?
                            
                                Calling javascript function inside HTML tag
                            
                                Content jumps horizontally whenever browser adds a scrollbar
                            
                                mobile html5 launch phone's native navigation app

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What does HTML Parsing mean? [closed]

Tags:

html

parsing

html-parsing

LightningBoltϟ

People also ask