Im trying to parse an the contents of an evernote checklist using beautifulsoup. But when I call the html parser on the contents, it keeps correcting the self-closing tags (en-todo), so when I try to get the text of the en-todo tags, its either blank.
note_body = '<en-todo checked="true" />window caulk<en-todo />cake pan<en-todo />cake mix<en-todo />salad mix<en-todo checked="true"/>painters tape<br />'
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(note_body, 'html.parser')
checklist_items = soup.find_all('en-todo')
print checklist_items
The above code returns just the tags, without any of the text.
[<en-todo checked="true"></en-todo>, <en-todo></en-todo>, <en-todo></en-todo>, <en-todo></en-todo>, <en-todo checked="true"></en-todo>]
You need to get the text messages that aren't enclosed in a tag!
You need to use tag.next_sibling!
>>> [each.next_sibling for each in checklist_items]
[u'window caulk', u'cake pan', u'cake mix', u'salad mix', u'painters tape']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With