Scrapy/XPath extract text from different tags (
OR
)

Question

I'm trying to scrape text from a website. Sometimes, the text is written in bullet points, sometimes just in plain text.

Text in Bullet points (XPath): /article/div[@class='border-bottom-grau'][1]/ul/li[1]

Text in Plain text (XPath): /article/div[@class='border-bottom-grau'][1]/p

I need the respectice text to be extracted (without div/ul/li/p tags etc). This is what I have tried so far:

info_Aufgabengebiet = info.xpath(".//article/div[@class='border-bottom-grau'][1][descendant::text()]").extract()

Output: see image Output

I also experimented with descendant-or-self, a /text() at the end, but neither worked. Simply, I want to extract all text no matter in bullet point or plain text. Pullet points should just be added, maybe with a ";" or ",".

Any help is much appreciated

Thanks

JaSON · Accepted Answer

You can use XPath with combined conditions

"/article/div[@class='border-bottom-grau'][1]/ul/li[1] | /article/div[@class='border-bottom-grau'][1]/p"

Forensic_07 · Answer

The union operator | mentioned in the other answer is a good solution. Alternately, depending on your output needs, you might try

/article/div[@class='border-bottom-grau'][1]//*[self::p or self::li]

Scrapy/XPath extract text from different tags (<p> OR <li>)

Tags:

python

html

extract

xpath

mixed

Julian

2 Answers

JaSON

Forensic_07

Recent Activity

Donate For Us