Suppose there are some html fragments like:
<a>
text in a
<b>text in b</b>
<c>text in c</c>
</a>
<a>
<b>text in b</b>
text in a
<c>text in c</c>
</a>
In which I want to extract texts within tag but excluding those tags while keeping their text, for instance, the content I want to extract above would be like "text in a text in b text in c" and "text in b text in a text inc". Now I could get the nodes using scrapy Selector css() function, then how could I proceed these nodes to get what I want? Any idea would be appreciated, thank you!
Here's what I managed to do:
from scrapy.selector import Selector
sel = Selector(text = html_string)
for node in sel.css('a *::text'):
print node.extract()
Assuming that html_string
is a variable holding the html in your question, this code produces the following output:
text in a
text in b
text in c
text in b
text in a
text in c
The selector a *::text()
matches all the text nodes which are descendents of a
nodes.
try this
response.xpath('//a/node()').extract()
You can use XPath's string()
function on the elements you select:
$ python
>>> import scrapy
>>> selector = scrapy.Selector(text="""<a>
... text in a
... <b>text in b</b>
... <c>text in c</c>
... </a>
... <a>
... <b>text in b</b>
... text in a
... <c>text in c</c>
... </a>""", type="html")
>>> for link in selector.css('a'):
... print link.xpath('string(.)').extract()
...
[u'\n text in a\n text in b\n text in c\n']
[u'\n text in b\n text in a\n text in c\n']
>>>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With