Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

regex pattern in python for parsing HTML title tags

I am learning to use both the re module and the urllib module in python and attempting to write a simple web scraper. Here's the code I've written to scrape just the title of websites:

#!/usr/bin/python

import urllib
import re

urls=["http://google.com","https://facebook.com","http://reddit.com"]

i=0

these_regex="<title>(.+?)</title>"
pattern=re.compile(these_regex)

while(i<len(urls)):
        htmlfile=urllib.urlopen(urls[i])
        htmltext=htmlfile.read()
        titles=re.findall(pattern,htmltext)
        print titles
        i+=1

This gives the correct output for Google and Reddit but not for Facebook - like so:

['Google']
[]
['reddit: the front page of the internet']

This is because, I found that on Facebook's page the title tag is as follows: <title id="pageTitle">. To accomodate for the additional id=, I modified the these_regex variable as follows: these_regex="<title.+?>(.+?)</title>". But this gives the following output:

[]
['Welcome to Facebook \xe2\x80\x94 Log in, sign up or learn more']
[]

How would I combine both so that I can take into account any additional parameters passed within the title tag?

like image 812
rahuL Avatar asked Nov 18 '13 10:11

rahuL


People also ask

How do you parse HTML with regex?

One simple way to parse HTML is to use regular expressions to repeatedly search for and extract substrings that match a particular pattern. We can construct a well-formed regular expression to match and extract the link values from the above text as follows: href="http[s]?://.

Can you use regex in a HTML document?

While arbitrary HTML with only a regex is impossible, it's sometimes appropriate to use them for parsing a limited, known set of HTML. If you have a small set of HTML pages that you want to scrape data from and then stuff into a database, regexes might work fine.

Is regex a Pythonic?

A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern. RegEx can be used to check if a string contains the specified search pattern.

What is parsing HTML in python?

Parsing means dividing a file or input into pieces of information/data that can be stored for our personal use in the future.


2 Answers

It is recommended that you use Beautiful Soup or any other parser to parse HTML, but if you badly want regex the following piece of code would do the job.

The regex code:

<title.*?>(.+?)</title>

How it works:

Regular expression visualization

Produces:

['Google']
['Welcome to Facebook - Log In, Sign Up or Learn More']
['reddit: the front page of the internet']
like image 88
K DawG Avatar answered Oct 13 '22 22:10

K DawG


You are using a regular expression, and matching HTML with such expressions get too complicated, too fast.

Use a HTML parser instead, Python has several to choose from. I recommend you use BeautifulSoup, a popular 3rd party library.

BeautifulSoup example:

from bs4 import BeautifulSoup

response = urllib2.urlopen(url)
soup = BeautifulSoup(response.read(), from_encoding=response.info().getparam('charset'))
title = soup.find('title').text

Since a title tag itself doesn't contain other tags, you can get away with a regular expression here, but as soon as you try to parse nested tags, you will run into hugely complex issues.

Your specific problem can be solved by matching additional characters within the title tag, optionally:

r'<title[^>]*>([^<]+)</title>'

This matches 0 or more characters that are not the closing > bracket. The '0 or more' here lets you match both extra attributes and the plain <title> tag.

like image 23
Martijn Pieters Avatar answered Oct 13 '22 22:10

Martijn Pieters