Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unable to extract text in the immediate level using BeautifulSoup

I followed this method to extract the text from the immediate level of a tag by using find(text=True, recursive=False) as mentioned in the another answer, but for some particular markups like u'<p>\n <strong>\n Established\n </strong>\n 1865\n</p>\n' it's not working:

Here's the code:

markup = u'<p>\n <strong>\n  Established\n </strong>\n 1865\n</p>\n'
s = BeautifulSoup(markup, 'lxml')
print s.find('p').find(text=True, recursive=False)

And it prints

45: u'\n'

It's working if I strip all the newlines \n from the markup it works good, but I don't think it's a good idea to just randomly strip all the newlines from the whole HTML file.

Any other solution ?

like image 883
Devi Prasad Khatua Avatar asked Nov 20 '25 08:11

Devi Prasad Khatua


1 Answers

find returns first match only. You need to use find_all:

print(s.find('p').find_all(text=True, recursive=False))
['\n', '\n 1865\n']

Deal with it as you need. For example, strip data and join pieces into final text:

data = s.find('p').find_all(text=True, recursive=False)
text = ' '.join(i.strip() for i in data)
print(text)
1865
like image 179
Mikhail M. Avatar answered Nov 21 '25 21:11

Mikhail M.



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!