Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Crawl and Concatenate in Scrapy

I'm trying to crawl movie list with Scrapy (I take only the Director & Movie title fields). Sometimes, there are two directors and Scrapy scape them as different. So the first director will be alon the movie title but for the second there will be no movie title.

So I created a condition like this :

if director2:
            item['director'] = map(unicode.strip,titres.xpath("tbody/tr/td/div/div[2]/div[3]/div[2]/div/h2/div/a/text()").extract())

The last div[2] exists only if there are two directors.

And I wanted to concatenate like this : director1, director2

Here is my full code :

class movies(scrapy.Spider):
name ="movielist"
allowed_domains = ["domain.com"]
start_urls = ["http://www.domain.com/list"]

def parse(self, response):
    for titles in response.xpath('//*[contains(concat(" ", normalize-space(@class), " "), " grid")]'):
        item = MovieItem()
        director2 = Selector(text=html_content).xpath("h2/div[2]/a/text()")
        if director2:
            item['director'] = map(unicode.strip,titres.xpath,string-join("h2//concat(div[1]/a/text(), ".", div[2]/a/text())").extract())
        else:
            item['director'] = map(unicode.strip,titres.xpath("h2/div/a/text()").extract())
            item['director'] = map(unicode.strip,titres.xpath,string-join("h2//concat(div[1]/a/text(), ".", div[2]/a/text())").extract())
            item['title'] = map(unicode.strip,titres.xpath("h2/a/text()").extract())
        yield item

Sample HTML with one director:

<h2>
    <a href="#">Movie's title</a>
    <div>Info</div>
    <div><a href="#">Director's name</a></div>
</h2>

Sometime, when there are two directors :

<h2>
    <a href="#">Movie's title</a>
    <div>Info</div>
    <div><a href="#">Director's name</a></div>
    <div><a href="#">Second director's name</a></div>
</h2>

Can you tell me what's wrong with my syntax ?

I tested without the condition and withtout the concatenation and it works very well.

This is my first time with Python so please be indulgent.

Thank you very much.

like image 204
cyclone200 Avatar asked Oct 18 '25 14:10

cyclone200


1 Answers

Get all the directors (1, 2 or more) and join them with join():

item['director'] = ", ".join(titles.xpath("h2/div/a/text()").extract())

A better Scrapy-specific approach though would be to use an ItemLoader and Join() processor. Define an ItemLoader:

from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader.processor import TakeFirst, MapCompose, Join

class MovieLoader(ItemLoader):

    default_output_processor = TakeFirst()

    director_in = MapCompose(unicode.strip)
    director_out = Join()

And let it worry about stripping and joining:

loader = MovieLoader(MovieItem(), titles)
...
loader.add_xpath("director", "h2/div/a/text()")
like image 196
alecxe Avatar answered Oct 20 '25 03:10

alecxe



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!