Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrape title by only downloading relevant part of webpage

I would like to scrape just the title of a webpage using Python. I need to do this for thousands of sites so it has to be fast. I've seen previous questions like retrieving just the title of a webpage in python, but all of the ones I've found download the entire page before retrieving the title, which seems highly inefficient as most often the title is contained within the first few lines of HTML.

Is it possible to download only the parts of the webpage until the title has been found?

I've tried the following, but page.readline() downloads the entire page.

import urllib2
print("Looking up {}".format(link))
hdr = {'User-Agent': 'Mozilla/5.0',
       'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
       'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
       'Accept-Encoding': 'none',
       'Accept-Language': 'en-US,en;q=0.8',
       'Connection': 'keep-alive'}
req = urllib2.Request(link, headers=hdr)
page = urllib2.urlopen(req, timeout=10)
content = ''
while '</title>' not in content:
    content = content + page.readline()

-- Edit --

Note that my current solution makes use of BeautifulSoup constrained to only process the title so the only place I can optimize is likely to not read in the entire page.

title_selector = SoupStrainer('title')
soup = BeautifulSoup(page, "lxml", parse_only=title_selector)
title = soup.title.string.strip()

-- Edit 2 --

I've found that BeautifulSoup itself splits the content into multiple strings in the self.current_data variable (see this function in bs4), but I'm unsure how to modify the code to basically stop reading all remaining content after the title has been found. One issue could be that redirects should still work.

-- Edit 3 --

So here's an example. I have a link www.xyz.com/abc and I have to follow this through any redirects (almost all of my links use a bit.ly kind of link shortening). I'm interested in both the title and domain that occurs after any redirections.

-- Edit 4 --

Thanks a lot for all of your assistance! The answer by Kul-Tigin works very well and has been accepted. I'll keep the bounty until it runs out though to see if a better answer comes up (as shown by e.g. a time measurement comparison).

-- Edit 5 --

For anyone interested: I've timed the accepted answer to be roughly twice as fast as my existing solution using BeautifulSoup4.

like image 892
pir Avatar asked May 22 '17 23:05

pir


People also ask

Can web scraping be used to pull data off of websites?

Web scraping refers to the process of extracting content and data from websites using software. For example, most price comparison services use web scrapers to read price information from several online stores. Another example is Google, which routinely scrapes or “crawls” the web to index websites.


2 Answers

You can defer downloading the entire response body by enabling stream mode of requests.

Requests 2.14.2 documentation - Advanced Usage

By default, when you make a request, the body of the response is downloaded immediately. You can override this behaviour and defer downloading the response body until you access the Response.content attribute with the stream parameter:

...

If you set stream to True when making a request, Requests cannot release the connection back to the pool unless you consume all the data or call Response.close. This can lead to inefficiency with connections. If you find yourself partially reading request bodies (or not reading them at all) while using stream=True, you should consider using contextlib.closing (documented here)

So, with this method, you can read the response chunk by chunk until you encounter the title tag. Since the redirects will be handled by the library you'll be ready to go.

Here's an error-prone code tested with Python 2.7.10 and 3.6.0:

try:
    from HTMLParser import HTMLParser
except ImportError:
    from html.parser import HTMLParser

import requests, re
from contextlib import closing

CHUNKSIZE = 1024
retitle = re.compile("<title[^>]*>(.*?)</title>", re.IGNORECASE | re.DOTALL)
buffer = ""
htmlp = HTMLParser()
with closing(requests.get("http://example.com/abc", stream=True)) as res:
    for chunk in res.iter_content(chunk_size=CHUNKSIZE, decode_unicode=True):
        buffer = "".join([buffer, chunk])
        match = retitle.search(buffer)
        if match:
            print(htmlp.unescape(match.group(1)))
            break
like image 133
Kul-Tigin Avatar answered Sep 29 '22 09:09

Kul-Tigin


Question: ... the only place I can optimize is likely to not read in the entire page.

This does not read the entire page.

Note: Unicode .decode() will raise Exception if you cut a Unicode sequence in the middle. Using .decode(errors='ignore') remove those sequences.

For instance:

import re
try:
    # PY3
    from urllib import request
except:
    import urllib2 as request

for url in ['http://www.python.org/', 'http://www.google.com', 'http://www.bit.ly']:
    f = request.urlopen(url)
    re_obj = re.compile(r'.*(<head.*<title.*?>(.*)</title>.*</head>)',re.DOTALL)
    Found = False
    data = ''
    while True:
        b_data = f.read(4096)
        if not b_data: break

        data += b_data.decode(errors='ignore')
        match = re_obj.match(data)
        if match:
            Found = True
            title = match.groups()[1]
            print('title={}'.format(title))
            break

    f.close()

Output:
title=Welcome to Python.org
title=Google
title=Bitly | URL Shortener and Link Management Platform

Tested with Python: 3.4.2 and 2.7.9

like image 4
stovfl Avatar answered Sep 29 '22 10:09

stovfl