I am learning to use both the <code>re</code> module and the <code>urllib</code> module in python and attempting to write a simple web scraper. Here's the code I've written to scrape just the title of websites: <pre class="prettyprint"><code>#!/usr/bin/python import urllib import re urls=["http://google.com","https://facebook.com","http://reddit.com"] i=0 these_regex="<title>(.+?)</title>" pattern=re.compile(these_regex) while(i<len(urls)): htmlfile=urllib.urlopen(urls[i]) htmltext=htmlfile.read() titles=re.findall(pattern,htmltext) print titles i+=1 </code></pre> This gives the correct output for Google and Reddit but not for Facebook - like so: <pre class="prettyprint"><code>['Google'] [] ['reddit: the front page of the internet'] </code></pre> This is because, I found that on Facebook's page the <code>title</code> tag is as follows: <code><title id="pageTitle"></code>. To accomodate for the additional <code>id=</code>, I modified the <code>these_regex</code> variable as follows: <code>these_regex="<title.+?>(.+?)</title>"</code>. But this gives the following output: <pre class="prettyprint"><code>[] ['Welcome to Facebook \xe2\x80\x94 Log in, sign up or learn more'] [] </code></pre> How would I combine both so that I can take into account any additional parameters passed within the <code>title</code> tag?

It is recommended that you use Beautiful Soup or any other parser to parse HTML, but if you badly want regex the following piece of code would do the job. The regex code: <pre class="prettyprint"><code><title.*?>(.+?)</title> </code></pre> How it works: <img src="https://i.stack.imgur.com/Z2yAI.png" alt="Regular expression visualization"> Produces: <pre class="prettyprint"><code>['Google'] ['Welcome to Facebook - Log In, Sign Up or Learn More'] ['reddit: the front page of the internet'] </code></pre>

You are using a regular expression, and matching HTML with such expressions get too complicated, too fast. Use a HTML parser instead, Python has several to choose from. I recommend you use BeautifulSoup, a popular 3rd party library. BeautifulSoup example: <pre class="prettyprint"><code>from bs4 import BeautifulSoup response = urllib2.urlopen(url) soup = BeautifulSoup(response.read(), from_encoding=response.info().getparam('charset')) title = soup.find('title').text </code></pre> Since a <code>title</code> tag itself doesn't contain other tags, you can get away with a regular expression here, but as soon as you try to parse nested tags, you will run into hugely complex issues. Your specific problem can be solved by matching additional characters within the <code>title</code> tag, optionally: <pre class="prettyprint"><code>r'<title[^>]*>([^<]+)</title>' </code></pre> This matches 0 or more characters that are not the closing <code>></code> bracket. The '0 or more' here lets you match both extra attributes and the plain <code><title></code> tag.

regex pattern in python for parsing HTML title tags

Tags:

python

html

regex

web-scraping

I am learning to use both the re module and the urllib module in python and attempting to write a simple web scraper. Here's the code I've written to scrape just the title of websites:

#!/usr/bin/python

import urllib
import re

urls=["http://google.com","https://facebook.com","http://reddit.com"]

i=0

these_regex="<title>(.+?)</title>"
pattern=re.compile(these_regex)

while(i<len(urls)):
        htmlfile=urllib.urlopen(urls[i])
        htmltext=htmlfile.read()
        titles=re.findall(pattern,htmltext)
        print titles
        i+=1

This gives the correct output for Google and Reddit but not for Facebook - like so:

['Google']
[]
['reddit: the front page of the internet']

This is because, I found that on Facebook's page the title tag is as follows: <title id="pageTitle">. To accomodate for the additional id=, I modified the these_regex variable as follows: these_regex="<title.+?>(.+?)</title>". But this gives the following output:

[]
['Welcome to Facebook \xe2\x80\x94 Log in, sign up or learn more']
[]

How would I combine both so that I can take into account any additional parameters passed within the title tag?

812

asked Nov 18 '13 10:11

rahuL

2 Answers

It is recommended that you use Beautiful Soup or any other parser to parse HTML, but if you badly want regex the following piece of code would do the job.

The regex code:

<title.*?>(.+?)</title>

How it works:

Regular expression visualization

Produces:

['Google']
['Welcome to Facebook - Log In, Sign Up or Learn More']
['reddit: the front page of the internet']

answered Oct 13 '22 22:10

K DawG

You are using a regular expression, and matching HTML with such expressions get too complicated, too fast.

Use a HTML parser instead, Python has several to choose from. I recommend you use BeautifulSoup, a popular 3rd party library.

BeautifulSoup example:

from bs4 import BeautifulSoup

response = urllib2.urlopen(url)
soup = BeautifulSoup(response.read(), from_encoding=response.info().getparam('charset'))
title = soup.find('title').text

Since a title tag itself doesn't contain other tags, you can get away with a regular expression here, but as soon as you try to parse nested tags, you will run into hugely complex issues.

Your specific problem can be solved by matching additional characters within the title tag, optionally:

r'<title[^>]*>([^<]+)</title>'

This matches 0 or more characters that are not the closing > bracket. The '0 or more' here lets you match both extra attributes and the plain <title> tag.

answered Oct 13 '22 22:10

Martijn Pieters

Related questions
                            
                                ImportError: No module named mime.multipart
                            
                                Python: how to build a dict from plain list of keys and values
                            
                                How to generate random numbers that are different? [duplicate]
                            
                                Very simple text classification by machine learning? [duplicate]
                            
                                PYQT4 - How do I compile and import a qrc file into my program?
                            
                                downloading an excel file from the web in python
                            
                                reading gmail is failing with IMAP
                            
                                Update python dictionary (add another value to existing key)
                            
                                using leaky relu in Tensorflow
                            
                                Argparse in iPython notebook: unrecognized arguments: -f
                            
                                How to take two lists and combine them excluding any duplicates?
                            
                                Using python, how to read a file starting at the seventh line ?
                            
                                Python IndentationError unindent does not match any outer indentation level [duplicate]
                            
                                How to get list of folders in a given bucket using Google Cloud API
                            
                                How do I programmatically check whether a GIF image is animated?
                            
                                How to get last Friday?
                            
                                AttributeError: 'Node' object has no attribute 'output_masks'
                            
                                Is there a better layout language than HTML for printing? [closed]
                            
                                How to implement a scripting language into a C application?
                            
                                Encode Python list to UTF-8

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With