I am trying to scrape the text only from body using python Scrapy, but haven't had any luck yet.
Wishing some scholars might be able to help me here scraping all the text from the <body>
tag.
Description. /html/head/title − This will select the <title> element, inside the <head> element of an HTML document. /html/head/title/text() − This will select the text within the same <title> element. //td − This will select all the elements from <td>.
When you are using text nodes in a XPath string function, then use . (dot) instead of using .//text(), because this produces the collection of text elements called as node-set.
Scrapy is written with Twisted, a popular event-driven networking framework for Python. Thus, it's implemented using a non-blocking (aka asynchronous) code for concurrency.So if we want to scrape the dynamic website we have to use selenium driver or other webdriver.
The callback of a request is a function that will be called when the response of that request is downloaded. The callback function will be called with the downloaded Response object as its first argument. Example: def parse_page1(self, response): return scrapy.
Scrapy uses XPath notation to extract parts of a HTML document. So, have you tried just using the /html/body
path to extract <body>
? (assuming it's nested in <html>
). It might be even simpler to use the //body
selector:
x.select("//body").extract() # extract body
You can find more information about the selectors Scrapy provides here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With