Scrapy

Question

I am able to retrieve the text before the
tag but not the text after it.

This is the website that I am trying to scrape the comments from: http://hamusoku.com/archives/9589071.html#comments

Starting from some comments include the
tag which I think means that the user hit enter. Is there a way to get the text before and after the
tag as a single comment?

Here is a sample of the source code

<li="comment-body"> ==$0
    "
    愛の言葉も、この瞬間は辛い。"
    <br>
    "
    胸が締め付けられそうだ。"

This is my code:

import scrapy


class HamusoSpider(scrapy.Spider):
    name = 'hamuso'
    start_urls = ['http://hamusoku.com/archives/9589071.html#comments/']

    def parse(self, response):
        for com in response.css('li.comment-body'):
        item = {
        'comment': com.css('li::text').extract_first()
        }
        yield item

This is the output that I am getting in the shell:

{'comment': '
	
	かなしいなぁ'}
{'comment': '
	
	海老蔵…つらいな'}
{'comment': '
	
	海老蔵には頑張って欲しいな'}
{'comment': '
	
	御冥福をお祈りします'}
{'comment': '
	
	泣かすなや。'}
{'comment': '
	
	海老蔵これからしっかりせなアカンぞ'}
{'comment': '
	
	愛の言葉も、この瞬間は辛い。'}
{'comment': '
	
	ただただ涙が止まらない会見だった'}

The last two comments both have a
tag and in both cases the second part of the comment is omitted.

I would really really appreciate any help with this.

Boswell Gathu · Accepted Answer

I have ran your spider and realised that when you extraxt_first(), you only get the first item or first comment the rest, which are after the <br> tags are unreacheable.

To solve this, use extract() this will return a list of all the comments in the comment-body

import scrapy

class HamusoSpider(scrapy.Spider):
    name = 'hamuso'
    start_urls = ['http://hamusoku.com/archives/9589071.html#comments/']
    def parse(self, response):
        for com in response.css('li.comment-body'):
            item = {'comment': com.css('li::text').extract()}
            yield item

the output I get for the last comment on your output is

{'comment': ['
	
	ただただ涙が止まらない会見だった', '
本当に短い人生だったけど豊かな人生だったのがわかる']}
{'comment': ['
	
	愛の言葉も、この瞬間は辛い。', '
胸が締め付けられそうだ。']}

Scrapy - scraping comments skips the text after <br>

Tags:

python-3.x

Jake Olesniewicz

1 Answers

Boswell Gathu

Recent Activity

Donate For Us