Scrape title by only downloading relevant part of webpage

Tags:

I would like to scrape just the title of a webpage using Python. I need to do this for thousands of sites so it has to be fast. I've seen previous questions like retrieving just the title of a webpage in python, but all of the ones I've found download the entire page before retrieving the title, which seems highly inefficient as most often the title is contained within the first few lines of HTML.

Is it possible to download only the parts of the webpage until the title has been found?

I've tried the following, but page.readline() downloads the entire page.

import urllib2
print("Looking up {}".format(link))
hdr = {'User-Agent': 'Mozilla/5.0',
       'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
       'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
       'Accept-Encoding': 'none',
       'Accept-Language': 'en-US,en;q=0.8',
       'Connection': 'keep-alive'}
req = urllib2.Request(link, headers=hdr)
page = urllib2.urlopen(req, timeout=10)
content = ''
while '</title>' not in content:
    content = content + page.readline()

-- Edit --

Note that my current solution makes use of BeautifulSoup constrained to only process the title so the only place I can optimize is likely to not read in the entire page.

title_selector = SoupStrainer('title')
soup = BeautifulSoup(page, "lxml", parse_only=title_selector)
title = soup.title.string.strip()

-- Edit 2 --

I've found that BeautifulSoup itself splits the content into multiple strings in the self.current_data variable (see this function in bs4), but I'm unsure how to modify the code to basically stop reading all remaining content after the title has been found. One issue could be that redirects should still work.

-- Edit 3 --

So here's an example. I have a link www.xyz.com/abc and I have to follow this through any redirects (almost all of my links use a bit.ly kind of link shortening). I'm interested in both the title and domain that occurs after any redirections.

-- Edit 4 --

Thanks a lot for all of your assistance! The answer by Kul-Tigin works very well and has been accepted. I'll keep the bounty until it runs out though to see if a better answer comes up (as shown by e.g. a time measurement comparison).

-- Edit 5 --

For anyone interested: I've timed the accepted answer to be roughly twice as fast as my existing solution using BeautifulSoup4.

892

asked May 22 '17 23:05

pir

2 Answers

You can defer downloading the entire response body by enabling stream mode of requests.

Requests 2.14.2 documentation - Advanced Usage

By default, when you make a request, the body of the response is downloaded immediately. You can override this behaviour and defer downloading the response body until you access the Response.content attribute with the stream parameter:

...

If you set stream to True when making a request, Requests cannot release the connection back to the pool unless you consume all the data or call Response.close. This can lead to inefficiency with connections. If you find yourself partially reading request bodies (or not reading them at all) while using stream=True, you should consider using contextlib.closing (documented here)

So, with this method, you can read the response chunk by chunk until you encounter the title tag. Since the redirects will be handled by the library you'll be ready to go.

Here's an error-prone code tested with Python 2.7.10 and 3.6.0:

try:
    from HTMLParser import HTMLParser
except ImportError:
    from html.parser import HTMLParser

import requests, re
from contextlib import closing

CHUNKSIZE = 1024
retitle = re.compile("<title[^>]*>(.*?)</title>", re.IGNORECASE | re.DOTALL)
buffer = ""
htmlp = HTMLParser()
with closing(requests.get("http://example.com/abc", stream=True)) as res:
    for chunk in res.iter_content(chunk_size=CHUNKSIZE, decode_unicode=True):
        buffer = "".join([buffer, chunk])
        match = retitle.search(buffer)
        if match:
            print(htmlp.unescape(match.group(1)))
            break

133

answered Sep 29 '22 09:09

Kul-Tigin

Question: ... the only place I can optimize is likely to not read in the entire page.

This does not read the entire page.

Note: Unicode .decode() will raise Exception if you cut a Unicode sequence in the middle. Using .decode(errors='ignore') remove those sequences.

For instance:

import re
try:
    # PY3
    from urllib import request
except:
    import urllib2 as request

for url in ['http://www.python.org/', 'http://www.google.com', 'http://www.bit.ly']:
    f = request.urlopen(url)
    re_obj = re.compile(r'.*(<head.*<title.*?>(.*)</title>.*</head>)',re.DOTALL)
    Found = False
    data = ''
    while True:
        b_data = f.read(4096)
        if not b_data: break

        data += b_data.decode(errors='ignore')
        match = re_obj.match(data)
        if match:
            Found = True
            title = match.groups()[1]
            print('title={}'.format(title))
            break

    f.close()

Output:
title=Welcome to Python.org
title=Google
title=Bitly | URL Shortener and Link Management Platform

Tested with Python: 3.4.2 and 2.7.9

answered Sep 29 '22 10:09

stovfl

Related questions
                            
                                XML (.xsd) feed validation against a schema
                            
                                Why is "import" implemented this way?
                            
                                what is the proper way to do logging in csv file?
                            
                                Resolving AmbiguousTimeError from Django's make_aware
                            
                                How to get WhoIs info by IP in Python 3?
                            
                                kafka-server-stop.sh not working when Kafka started from Python script
                            
                                How do I can format exception stacktraces in Python logging?
                            
                                Casting a new derived column in a DataFrame from boolean to integer
                            
                                Understand the Find() function in Beautiful Soup
                            
                                How to calculate day's difference between successive pandas dataframe rows with condition
                            
                                How to get position of key in a dictionary in python
                            
                                TypeError: __init__() should return None, not 'int'
                            
                                How to set the logging level for the elasticsearch library differently to my own logging?
                            
                                python pandas: filter out records with null or empty string for a given field
                            
                                How to run multiple commands synchronously from one subprocess.Popen command?
                            
                                What does the value of 'leaf' in the following xgboost model tree diagram means?
                            
                                How to change marker size with pandas.plot()
                            
                                Bokeh: AttributeError: 'DataFrame' object has no attribute 'tolist'
                            
                                Looping over a tensor
                            
                                Getting error when trying to install python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scrape title by only downloading relevant part of webpage

Tags:

performance

python

html

web-scraping

pir

People also ask

2 Answers

Kul-Tigin

stovfl

Recent Activity

Donate For Us