I'm trying to parse some html with BeautifulSoup4 and Python 2.7.6, but the string is returning "None". The HTML i'm trying to parse is:
<div class="booker-booking">
2 rooms
·
USD 0
<!-- Commission: USD -->
</div>
The snippet from python I have is:
data = soup.find('div', class_='booker-booking').string
I've also tried the following two:
data = soup.find('div', class_='booker-booking').text
data = soup.find('div', class_='booker-booking').contents[0]
Which both return:
u'\n\t\t2\xa0rooms \n\t\t\xb7\n\t\tUSD\xa00\n\t\t\n
I'm ultimately trying to get the first line into a variable just saying "2 Rooms", and the third line into another variable just saying "USD 0".
.string returns None because the text node is not the only child (there is a comment).
from bs4 import BeautifulSoup, Comment
soup = BeautifulSoup(html)
div = soup.find('div', 'booker-booking')
# remove comments
text = " ".join(div.find_all(text=lambda t: not isinstance(t, Comment)))
# -> u'\n 2\xa0rooms\n \xb7\n USD\xa00\n \n'
To remove Unicode whitespace:
text = " ".join(text.split())
# -> u'2 rooms \xb7 USD 0'
print text
# -> 2 rooms · USD 0
To get your final variables:
var1, var2 = [s.strip() for s in text.split(u"\xb7")]
# -> u'2 rooms', u'USD 0'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With