I write a script to capture the independence date of few countries on Wikipedia.
For example, with the Kazakhstan:
URL_QS = 'https://en.wikipedia.org/wiki/Kazakhstan'
r = requests.get(URL_QS)
soup = BeautifulSoup(r.text, 'lxml')
# Only keep the infobox (top right)
infobox = soup.find("table", class_="infobox geography vcard")
if infobox:
formation = infobox.find_next(text = re.compile("Formation"))
if formation:
independence = formation.find_next(text = re.compile("independence"))
if independence:
independ_date = independence.find_next("td").text
else:
independence = formation.find_next(text = re.compile("Independence"))
if independence:
independ_date = independence.find_next("td").text
print(independ_date)
And I have the following output:
Almaty
This output is not localised in the infobox but after, in the text. It's because "formation.find_next(text = re.compile("independence"))" found something outside of the infobox but I don't understand why the research should not be done only in the infobox ? How can I just search in this field ?
Thank you in advance for your help!
It's because "formation.find_next(text = re.compile("independence"))" found something outside of the infobox
add .extract()
to your soup.find()
to search only inside the infobox geography vcard
element.
infobox = soup.find("table", class_="infobox geography vcard").extract()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With