Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python & Beautiful Soup: Searching only in a certain class

I write a script to capture the independence date of few countries on Wikipedia.

For example, with the Kazakhstan:

URL_QS = 'https://en.wikipedia.org/wiki/Kazakhstan'
r = requests.get(URL_QS)
soup = BeautifulSoup(r.text, 'lxml')

# Only keep the infobox (top right)
infobox = soup.find("table", class_="infobox geography vcard")

if infobox:
    formation = infobox.find_next(text = re.compile("Formation"))

    if formation: 
        independence = formation.find_next(text = re.compile("independence")) 

        if independence:
            independ_date = independence.find_next("td").text
        else:
            independence = formation.find_next(text = re.compile("Independence"))

            if independence:
                independ_date = independence.find_next("td").text


print(independ_date)

And I have the following output:

Almaty

This output is not localised in the infobox but after, in the text. It's because "formation.find_next(text = re.compile("independence"))" found something outside of the infobox but I don't understand why the research should not be done only in the infobox ? How can I just search in this field ?

Thank you in advance for your help!

like image 232
jGsch Avatar asked Nov 08 '22 13:11

jGsch


1 Answers

It's because "formation.find_next(text = re.compile("independence"))" found something outside of the infobox

add .extract() to your soup.find() to search only inside the infobox geography vcard element.

infobox = soup.find("table", class_="infobox geography vcard").extract()

like image 57
Nik Markin Avatar answered Nov 14 '22 21:11

Nik Markin