Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyparsing issue- scraping website without using HTML tags

Tags:

pyparsing

I'm trying to use pyparsing to parse information from a website, however, all the examples I find primarily use HTML tags to identify different bits of text.

For example, the below is difficult to separate by using HTML tags (the program doesn't work with this site). How could I split the authors, title etc?

import urllib

from pyparsing import *

paraStart,paraEnd = makeHTMLTags("p")

# read HTML from a web page
serverListPage = urllib.urlopen( "http://www.cs.cf.ac.uk/contactsandpeople/allpubs.php?emailname=C.L.Mumford" )
htmlText = serverListPage.read()
serverListPage.close()

para = paraStart + SkipTo(paraEnd).setResultsName("body") + paraEnd


for tokens,start,end in para.scanString(htmlText):
    print tokens.body,'->',tokens.href

I'm relatively new to pyparsing and have looked through books & the web for examples of this. Any help would be much appreciated. Thanks.

EDIT: When I run the program, I get the following: Skip to content Skip to navigation menu

However, if I change to search from ("p") to ("li") for a different site, it pulls the information in a block.

like image 807
JeremyS Avatar asked Dec 20 '25 18:12

JeremyS


1 Answers

You have to know a lot more about the content of the web page you are scraping data from. If you just blindly throw tag-tagend parsers at the page, you will just get random chunks of the text.

Try printing out the whole page HTML (which your script captures in the variable htmlText), then start looking for patterns in the text that will point to the data you are interested in. The data itself might be part of the pattern, that's okay. The bits of the text inside <>s are the HTML tags - the reason pyparsing includes the makeHTMLTags method is that the structure of text inside a tag can vary wildly, with optional and unexpected attributes, or attributes in unexpected order, or with unexpected upper/lower case, or unexpected whitespace - makeHTMLTags covers all that stuff, which is why most web scrapers written with pyparsing use that method to help define the pattern that gets to the interesting data.

Try this process: print out htmlText on paper. Use a blue highlighter to highlight the data that you want. Then use a yellow highlighter that identifies surrounding data or tags that will help locate that data. Now you have a template on how to build up your pyparsing expression to extract that data. You've already started using results name (the 'body' definition in your parser) - that's a good habit, keep it up. Mark all the expressions for the blue text with results names, so that after the overall pattern has been matched, you can just get at the individual bits using the names.

Good luck!

like image 186
PaulMcG Avatar answered Dec 24 '25 12:12

PaulMcG