Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy Body Text Only

I am trying to scrape the text only from body using python Scrapy, but haven't had any luck yet.

Wishing some scholars might be able to help me here scraping all the text from the <body> tag.

like image 661
mmrs151 Avatar asked Mar 22 '11 10:03

mmrs151


People also ask

How do I extract text from Scrapy?

Description. /html/head/title − This will select the <title> element, inside the <head> element of an HTML document. /html/head/title/text() − This will select the text within the same <title> element. //td − This will select all the elements from <td>.

How do I get text from XPath in Scrapy?

When you are using text nodes in a XPath string function, then use . (dot) instead of using .//text(), because this produces the collection of text elements called as node-set.

Can Scrapy scrape dynamic content?

Scrapy is written with Twisted, a popular event-driven networking framework for Python. Thus, it's implemented using a non-blocking (aka asynchronous) code for concurrency.So if we want to scrape the dynamic website we have to use selenium driver or other webdriver.

What is callback in Scrapy?

The callback of a request is a function that will be called when the response of that request is downloaded. The callback function will be called with the downloaded Response object as its first argument. Example: def parse_page1(self, response): return scrapy.


1 Answers

Scrapy uses XPath notation to extract parts of a HTML document. So, have you tried just using the /html/body path to extract <body>? (assuming it's nested in <html>). It might be even simpler to use the //body selector:

x.select("//body").extract()    # extract body

You can find more information about the selectors Scrapy provides here.

like image 131
Eli Bendersky Avatar answered Oct 06 '22 23:10

Eli Bendersky