Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy/XPath extract text from different tags (<p> OR <li>)

I'm trying to scrape text from a website. Sometimes, the text is written in bullet points, sometimes just in plain text.

Text in Bullet points (XPath): /article/div[@class='border-bottom-grau'][1]/ul/li[1]

Text in Plain text (XPath): /article/div[@class='border-bottom-grau'][1]/p

I need the respectice text to be extracted (without div/ul/li/p tags etc). This is what I have tried so far:

info_Aufgabengebiet = info.xpath(".//article/div[@class='border-bottom-grau'][1][descendant::text()]").extract()

Output: see image Output

I also experimented with descendant-or-self, a /text() at the end, but neither worked. Simply, I want to extract all text no matter in bullet point or plain text. Pullet points should just be added, maybe with a ";" or ",".

Any help is much appreciated

Thanks

like image 861
Julian Avatar asked Dec 30 '25 08:12

Julian


2 Answers

You can use XPath with combined conditions

"/article/div[@class='border-bottom-grau'][1]/ul/li[1] | /article/div[@class='border-bottom-grau'][1]/p"
like image 102
JaSON Avatar answered Jan 01 '26 01:01

JaSON


The union operator | mentioned in the other answer is a good solution. Alternately, depending on your output needs, you might try

/article/div[@class='border-bottom-grau'][1]//*[self::p or self::li]

like image 31
Forensic_07 Avatar answered Jan 01 '26 01:01

Forensic_07



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!