What language/tool should I use for HTML parsing?

Tags:

I have a couple of websites that I want to extract data from and based on previous experiences, this isn't as easy as it sound. Why? Simply because the HTML pages I have to parse aren't properly formatted (missing closing tag, etc.).

Considering that I have no constraints regarding the technology, language or tool that I can use, what are your suggestions to easily parse and extract data from HTML pages? I have tried HTML Agility Pack, BeautifulSoup, and even these tools aren't perfect (HTML Agility Pack is buggy, and BeautifulSoup parsing engine doesn't work with the pages I am passing to it).

684

asked Feb 24 '09 14:02

Martin

2 Answers

You can use pretty much any language you like just don't try and parse HTML with regular expressions.

So let me rephrase that and say: you can use any language you like that has a HTML parser, which is pretty much everything invented in the last 15-20 years.

If you're having issues with particular pages I suggest you look into repairing them with HTML Tidy.

160

answered Sep 21 '22 03:09

cletus

I think hpricot (linked by Colin Pickard) is ace. Add scrubyt to the mix and you get a great html scraping and browsing interface with the text matching power of Ruby http://scrubyt.org/

here is some example code from http://github.com/scrubber/scrubyt_examples/blob/7a219b58a67138da046aa7c1e221988a9e96c30e/twitter.rb

require 'rubygems'
require 'scrubyt'

# Simple exmaple for scraping basic
# information from a public Twitter
# account.

# Scrubyt.logger = Scrubyt::Logger.new

twitter_data = Scrubyt::Extractor.define do
  fetch 'http://www.twitter.com/scobleizer'

  profile_info '//ul[@class="about vcard entry-author"]' do
    full_name "//li//span[@class='fn']"
    location "//li//span[@class='adr']"
    website "//li//a[@class='url']/@href"
    bio "//li//span[@class='bio']"
  end
end

puts twitter_data.to_xml

answered Sep 20 '22 03:09

Stewart Robinson

Related questions
                            
                                HTML input only accept 0-9(Number in English) and ০-৯ (Number in Bengali)?
                            
                                How to highlight CSS grid cells?
                            
                                How to create HTML Data entry form using Google spreadsheet as backend
                            
                                Is there a way to set audio information (title, album, artist) for HTML5 audio on browser
                            
                                How to save cookies for Dark/Light Mode Toggle?
                            
                                How to include inline .svg in Nuxt application
                            
                                Prevent Browsers not to remember password
                            
                                Does Intersection Observer works from inside of a cross-domain iframe, with respect to the viewport?
                            
                                Random in CSS or JS
                            
                                CSS Grid how to push items to the bottom and then to the left
                            
                                Dark mode flickers a white background for a millisecond on reload
                            
                                Part of svg getting hidden after transformation
                            
                                CSS: how to create an infinitely-moving repeating linear gradient?
                            
                                Generating the same 2 random colors for 2 different objects
                            
                                Using css animation to make a div element move to each corner of the page
                            
                                JS scripts not downloaded and evaluated sequentially
                            
                                How to use API color variable in css file?
                            
                                Beginning Web Development on Plan 9
                            
                                What is the best practice for passing variables from one HTML page to another?
                            
                                Any alternative to blockUI for jQuery?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What language/tool should I use for HTML parsing?

Tags:

html

html-parsing

screen-scraping

Martin

People also ask

2 Answers

cletus

Stewart Robinson

Recent Activity

Donate For Us