Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What language/tool should I use for HTML parsing?

I have a couple of websites that I want to extract data from and based on previous experiences, this isn't as easy as it sound. Why? Simply because the HTML pages I have to parse aren't properly formatted (missing closing tag, etc.).

Considering that I have no constraints regarding the technology, language or tool that I can use, what are your suggestions to easily parse and extract data from HTML pages? I have tried HTML Agility Pack, BeautifulSoup, and even these tools aren't perfect (HTML Agility Pack is buggy, and BeautifulSoup parsing engine doesn't work with the pages I am passing to it).

like image 684
Martin Avatar asked Feb 24 '09 14:02

Martin


People also ask

How do you parse HTML?

HTML parsing involves tokenization and tree construction. HTML tokens include start and end tags, as well as attribute names and values. If the document is well-formed, parsing it is straightforward and faster. The parser parses tokenized input into the document, building up the document tree.

Which Python package can you use to parse HTML?

Beautiful Soup (bs4) is a Python library that is used to parse information out of HTML or XML files. It parses its input into an object on which you can run a variety of searches. To start parsing an HTML file, import the Beautiful Soup library and create a Beautiful Soup object as shown in the following code example.

Can you use regular expressions to parse HTML?

One simple way to parse HTML is to use regular expressions to repeatedly search for and extract substrings that match a particular pattern. We can construct a well-formed regular expression to match and extract the link values from the above text as follows: href="http://. +?"


2 Answers

You can use pretty much any language you like just don't try and parse HTML with regular expressions.

So let me rephrase that and say: you can use any language you like that has a HTML parser, which is pretty much everything invented in the last 15-20 years.

If you're having issues with particular pages I suggest you look into repairing them with HTML Tidy.

like image 160
cletus Avatar answered Sep 21 '22 03:09

cletus


I think hpricot (linked by Colin Pickard) is ace. Add scrubyt to the mix and you get a great html scraping and browsing interface with the text matching power of Ruby http://scrubyt.org/

here is some example code from http://github.com/scrubber/scrubyt_examples/blob/7a219b58a67138da046aa7c1e221988a9e96c30e/twitter.rb

require 'rubygems'
require 'scrubyt'

# Simple exmaple for scraping basic
# information from a public Twitter
# account.

# Scrubyt.logger = Scrubyt::Logger.new

twitter_data = Scrubyt::Extractor.define do
  fetch 'http://www.twitter.com/scobleizer'

  profile_info '//ul[@class="about vcard entry-author"]' do
    full_name "//li//span[@class='fn']"
    location "//li//span[@class='adr']"
    website "//li//a[@class='url']/@href"
    bio "//li//span[@class='bio']"
  end
end

puts twitter_data.to_xml
like image 43
Stewart Robinson Avatar answered Sep 20 '22 03:09

Stewart Robinson