I'm trying to scrape text from a website. Sometimes, the text is written in bullet points, sometimes just in plain text.
Text in Bullet points (XPath): /article/div[@class='border-bottom-grau'][1]/ul/li[1]
Text in Plain text (XPath): /article/div[@class='border-bottom-grau'][1]/p
I need the respectice text to be extracted (without div/ul/li/p tags etc). This is what I have tried so far:
info_Aufgabengebiet = info.xpath(".//article/div[@class='border-bottom-grau'][1][descendant::text()]").extract()
Output: see image Output
I also experimented with descendant-or-self, a /text() at the end, but neither worked. Simply, I want to extract all text no matter in bullet point or plain text. Pullet points should just be added, maybe with a ";" or ",".
Any help is much appreciated
Thanks
You can use XPath with combined conditions
"/article/div[@class='border-bottom-grau'][1]/ul/li[1] | /article/div[@class='border-bottom-grau'][1]/p"
The union operator | mentioned in the other answer is a good solution. Alternately, depending on your output needs, you might try
/article/div[@class='border-bottom-grau'][1]//*[self::p or self::li]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With