Using beautiful soup to clean up scraped HTML from scrapy

Question

I'm using scrapy to try and scrape some data that I need off Google Scholar. Consider, as an example the following link: http://scholar.google.com/scholar?q=intitle%3Apython+xpath

Now, I'd like to scrape all the titles off this page. The process that I am following is as follows:

scrapy shell "http://scholar.google.com/scholar?q=intitle%3Apython+xpath"

which gives me the scrapy shell, inside which I do:

>>> sel.xpath('//h3[@class="gs_rt"]/a').extract()

[
 u'<a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.122.4438&amp;rep=rep1&amp;type=pdf"><b>Python </b>Paradigms for XML</a>', 
 u'<a href="https://svn.eecs.jacobs-university.de/svn/eecs/archive/bsc-2009/sbhushan.pdf">NCClient: A <b>Python </b>Library for NETCONF Clients</a>', 
 u'<a href="http://hal.archives-ouvertes.fr/hal-00759589/">PALSE: <b>Python </b>Analysis of Large Scale (Computer) Experiments</a>', 
 u'<a href="http://i.iinfo.cz/r2/kd/xmlprague2007.pdf#page=53"><b>Python </b>and XML</a>', 
 u'<a href="http://www.loadaveragezero.com/app/drx/Programming/Languages/Python/">drx: <b>Python </b>Programming Language [Computers: Programming: Languages: <b>Python</b>]-loadaverageZero</a>', 
 u'<a href="http://www.worldcolleges.info/sites/default/files/py10.pdf">XML and <b>Python </b>Tutorial</a>', 
 u'<a href="http://dl.acm.org/citation.cfm?id=2555791">Zato\u2014agile ESB, SOA, REST and cloud integrations in <b>Python</b></a>', 
 u'<a href="ftp://ftp.sybex.com/4021/4021index.pdf">XML Processing with Perl, <b>Python</b>, and PHP</a>', 
 u'<a href="http://books.google.com/books?hl=en&amp;lr=&amp;id=El4TAgAAQBAJ&amp;oi=fnd&amp;pg=PT8&amp;dq=python+xpath&amp;ots=RrFv0f_Y6V&amp;sig=tSXzPJXbDi6KYnuuXEDnZCI7rDA"><b>Python </b>&amp; XML</a>', 
 u'<a href="https://code.grnet.gr/projects/ncclient/repository/revisions/efed7d4cd5ac60cbb7c1c38646a6d6dfb711acc9/raw/docs/proposal.pdf">A <b>Python </b>Module for NETCONF Clients</a>'
]

As you can see, this output is raw HTML that needs cleaning. I now have a good sense of how to clean this HTML up. The simplest way is probably to just BeautifulSoup and try something like:

t = sel.xpath('//h3[@class="gs_rt"]/a').extract()
soup = BeautifulSoup(t)
text_parts = soup.findAll(text=True)
text = ''.join(text_parts)

This is based off an earlier SO question. The regexp version has been suggested, but I am guessing that BeautifulSoup will be more robust.

I'm a scrapy n00b and can't figure out how to embed this in my spider. I tried

from scrapy.spider import Spider
from scrapy.selector import Selector
from bs4 import BeautifulSoup

from scholarscrape.items import ScholarscrapeItem

class ScholarSpider(Spider):
    name = "scholar"
    allowed_domains = ["scholar.google.com"]
    start_urls = [
        "http://scholar.google.com/scholar?q=intitle%3Apython+xpath"
    ]

    def parse(self, response):
        sel = Selector(response)
        item = ScholarscrapeItem()        
        t = sel.xpath('//h3[@class="gs_rt"]/a').extract()
        soup = BeautifulSoup(t)
        text_parts = soup.findAll(text=True)
        text = ''.join(text_parts)
        item['title'] = text
        return(item)

But that didn't quite work. Any suggestions would be helpful.

Edit 3: Based on suggestions, I have modified my spider file to:

from scrapy.spider import Spider
from scrapy.selector import Selector
from bs4 import BeautifulSoup

from scholarscrape.items import ScholarscrapeItem

class ScholarSpider(Spider):
    name = "dmoz"
    allowed_domains = ["sholar.google.com"]
    start_urls = [
        "http://scholar.google.com/scholar?q=intitle%3Anine+facts+about+top+journals+in+economics"
    ]

    def parse(self, response):
        sel = Selector(response)
        item = ScholarscrapeItem()        
        titles = sel.xpath('//h3[@class="gs_rt"]/a')

        for title in titles:
            title = item.xpath('.//text()').extract()
            print "".join(title)

However, I get the following output:

2014-02-17 15:11:12-0800 [scrapy] INFO: Scrapy 0.22.2 started (bot: scholarscrape)
2014-02-17 15:11:12-0800 [scrapy] INFO: Optional features available: ssl, http11
2014-02-17 15:11:12-0800 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'scholarscrape.spiders', 'SPIDER_MODULES': ['scholarscrape.spiders'], 'BOT_NAME': 'scholarscrape'}
2014-02-17 15:11:12-0800 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-02-17 15:11:13-0800 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-02-17 15:11:13-0800 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-02-17 15:11:13-0800 [scrapy] INFO: Enabled item pipelines:
2014-02-17 15:11:13-0800 [dmoz] INFO: Spider opened
2014-02-17 15:11:13-0800 [dmoz] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-02-17 15:11:13-0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-02-17 15:11:13-0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-02-17 15:11:13-0800 [dmoz] DEBUG: Crawled (200) <GET http://scholar.google.com/scholar?q=intitle%3Apython+xml> (referer: None)
2014-02-17 15:11:13-0800 [dmoz] ERROR: Spider error processing <GET http://scholar.google.com/scholar?q=intitle%3Apython+xml>
 Traceback (most recent call last):
   File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/base.py", line 1178, in mainLoop
     self.runUntilCurrent()
   File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/base.py", line 800, in runUntilCurrent
     call.func(*call.args, **call.kw)
   File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 368, in callback
     self._startRunCallbacks(result)
   File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 464, in _startRunCallbacks
     self._runCallbacks()
 --- <exception caught here> ---
   File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 551, in _runCallbacks
     current.result = callback(current.result, *args, **kw)
   File "/Users/krishnan/work/research/journals/code/scholarscrape/scholarscrape/spiders/scholar_spider.py", line 20, in parse
     title = item.xpath('.//text()').extract()
   File "/Library/Python/2.7/site-packages/scrapy/item.py", line 65, in __getattr__
     raise AttributeError(name)
 exceptions.AttributeError: xpath

2014-02-17 15:11:13-0800 [dmoz] INFO: Closing spider (finished)
2014-02-17 15:11:13-0800 [dmoz] INFO: Dumping Scrapy stats:
 {'downloader/request_bytes': 247,
  'downloader/request_count': 1,
  'downloader/request_method_count/GET': 1,
  'downloader/response_bytes': 108851,
  'downloader/response_count': 1,
  'downloader/response_status_count/200': 1,
  'finish_reason': 'finished',
  'finish_time': datetime.datetime(2014, 2, 17, 23, 11, 13, 196648),
  'log_count/DEBUG': 3,
  'log_count/ERROR': 1,
  'log_count/INFO': 7,
  'response_received_count': 1,
  'scheduler/dequeued': 1,
  'scheduler/dequeued/memory': 1,
  'scheduler/enqueued': 1,
  'scheduler/enqueued/memory': 1,
  'spider_exceptions/AttributeError': 1,
  'start_time': datetime.datetime(2014, 2, 17, 23, 11, 13, 21701)}
2014-02-17 15:11:13-0800 [dmoz] INFO: Spider closed (finished)

Edit 2: My original question was quite different, but I am now convinced that this is the right way to proceed. Original question (and first edit below):

I'm using scrapy to try and scrape some data that I need off Google Scholar. Consider, as an example the following link:

http://scholar.google.com/scholar?q=intitle%3Apython+xpath

Now, I'd like to scrape all the titles off this page. The process that I am following is as follows:

scrapy shell "http://scholar.google.com/scholar?q=intitle%3Apython+xpath"

which gives me the scrapy shell, inside which I do:

>>> sel.xpath('string(//h3[@class="gs_rt"]/a)').extract()
[u'Python Paradigms for XML']

As you can see, this only selects the first title, and none of the others on the page. I can't figure out what I should modify my XPath to, so that I select all such elements on the page. Any help is greatly appreciated.

Edit 1: My first approach was to try

>>> sel.xpath('//h3[@class="gs_rt"]/a/text()').extract()
[u'Paradigms for XML', u'NCClient: A ', u'Library for NETCONF Clients', 
 u'PALSE: ', u'Analysis of Large Scale (Computer) Experiments', u'and XML', 
 u'drx: ', u'Programming Language [Computers: Programming: Languages: ',
 u']-loadaverageZero', u'XML and ', u'Tutorial', 
 u'Zato\u2014agile ESB, SOA, REST and cloud integrations in ', 
 u'XML Processing with Perl, ', u', and PHP', u'& XML', u'A ', 
 u'Module for NETCONF Clients']

The problem with this is approach is that if you look at the actual Google Scholar page, you will see that the first entry is actually 'Python Paradigms for XML' and not 'Paradigms for XML' as scrapy returns. My guess for this behaviour is that 'Python' is trapped inside tags which is why text() is not doing what we want him to do.

Pawel Miech · Accepted Answer

This is a really interesting and rather difficult question. The problem you're facing concerns the fact that "Python" in the title is in bold, and it is treated as node, while the rest of the title is simply a text, therefore text() extracts only textual content and not content of <b> node.

Here's my solution. First get all the links:

titles = sel.xpath('//h3[@class="gs_rt"]/a')

then iterate over them and select all textual content of each node, in other words join <b> node with text node for each children of this link

for item in titles:
    title = item.xpath('.//text()').extract()
    print "".join(title)

This works because in a for loop you will be dealing with textual content of children of each link and thus you will be able to join matching elements. Title in the loop will be equal for instance :[u'Python ', u'Paradigms for XML'] or [u'NCClient: A ', u'Python ', u'Library for NETCONF Clients']

Using beautiful soup to clean up scraped HTML from scrapy

Tags:

xpath

scrapy

krishnan

1 Answers

Pawel Miech

Recent Activity

Donate For Us

Using beautiful soup to clean up scraped HTML from scrapy

Tags:

xpath

scrapy

krishnan

1 Answers

Pawel Miech

Related questions

Recent Activity

Donate For Us