I'm trying to get articles from wired.com. Generally their articles' content look like this:
<article itemprop="articleBody">
<p>Some text</p>
<p>Next text</p>
<p>...</p>
<p>...</p>
</article>
or like this:
<article itemprop="articleBody">
<div class="listicle-captions marg-t...">
<p></p>
</div>
</article>
So I want if the page is of type 1, the <p>
and <h>
are extracted, while if the page is of type 2 - do something else. So, if the <p>
and <h>
are direct descendants of <article>
, then it's type 1.
I tried the following code, it looks for <p>
and <h>
and prints out the tag names. The thing is, the recursive="False"
doesn't seem to help because when tested on type 2 page, it finds the tags, while it shouldn't (I espected to get a NonType
object).
import urllib.request
from bs4 import BeautifulSoup
import datetime
import html
import sys
articleUrl="https://www.wired.com/2016/07/greatest-feats-inventions-100-years-boeing/"
soupArticle=BeautifulSoup(urllib.request.urlopen(articleUrl), "html.parser")
articleBody=soupArticle.find("article", {"itemprop":"articleBody"})
articleContentTags=articleBody.findAll(["h1", "h2","h3", "p"], recursive="False")
for tag in articleContentTags:
print(tag.name)
print(tag.parent.encode("utf-8"))
Why doesn't it work?
PS Also, is there a difference between using findAll
and findChildren
in general and in this particular case? These two look the same to me..
The string literal "False"
is not the same as use the boolean False
, you need to actually pass recursive=False
:
articleBody.find_all(["h1", "h2","h3", "p"], recursive=False)
Any non empty string is going to be considered a truthy value , the only string you could pass that would work would be an empty string i.e recursive=""
.
In [17]: bool("False")
Out[17]: True
In [18]: bool("foo")
Out[18]: True
In [19]: bool("")
Out[19]: False
But stick to using the actual boolean False
, also you will get an empty list/ResultSet returned with recursive=False
, not None as you are calling find_all not find.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With