Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scraping text without javascript code using scrapy

I'm currently setting up a bunch of spiders using scrapy. These spiders are supposed to extract only text (articles, forum posts, paragraphs, etc) from the target sites.

The problem is : sometimes, my target node contains a <script> tag and so the scraped text contains javascript code.

Here is a link to a real example of what I'm working with. In this case my target node is //td[@id='contenuStory']. The problem is that there's a <script> tag in the first child div.

I've spent a lot of time searching for a solution on the web and on SO, but I couldn't find anything. I hope I haven't missed something obvious !

Example

HTML response (only the target node) :

<div id="content">
    <div id="part1">Some text</div>
    <script>var s = 'javascript I don't want';</script>
    <div id="part2">Some other text</div>
</div>

What I want in my item :

Some text
Some other text

What I get :

Some text
var s = 'javascript I don't want';
Some other text

My code

Given an xpath selector I'm using the following function to extract the text :

def getText(hxs):
    if len(hxs) > 0:
        l = hxs.select('string(.)')
        if len(l) > 0:
            s = l[0].extract().encode('utf-8')
        else:
            s = hxs[0].extract().encode('utf-8')
        return s
    else:
        return 0

I've tried using XPath axes (things like child::script) but to no avail.

like image 320
n6g7 Avatar asked Nov 04 '13 18:11

n6g7


3 Answers

Try utils functions from w3lib.html:

from w3lib.html import remove_tags, remove_tags_with_content

input = hxs.select('//div[@id="content"]').extract()
output = remove_tags(remove_tags_with_content(input, ('script', )))
like image 149
kev Avatar answered Oct 18 '22 18:10

kev


You can use after your xPath expression [not (ancestor-or-self::script].

This will not capture scripts but you can use it to prevent other things like [not (ancestor-or-self::script or ancestor-or-self::noscript or ancestor-or-self::style)] this will not capture any scripts, noscripts, or any css that is not part of the text.

Example:

//article//p//text()[not (ancestor-or-self::script or ancestor-or-self::noscript or ancestor-or-self::style)]
like image 2
Jorge Parreira Avatar answered Oct 18 '22 18:10

Jorge Parreira


You can try this XPath expression:

hxs.select('//td[@id="contenuStory"]/descendant-or-self::*[not(self::script)]/text()').extract()

i.e, all children text nodes of descendants of //td[@id='contenuStory'] that are not script nodes

To add space between the text nodes you can use something like:

u' '.join(
    hxs.select(
        '//td[@id="contenuStory"]/descendant-or-self::*[not(self::script)]/text()').extract()
)
like image 1
paul trmbrth Avatar answered Oct 18 '22 17:10

paul trmbrth