I am trying to parse this webpage.
As shown below, each page has the ability stats. I am eventually trying to parse all abilities into an object. e.g. {'corners': 15, 'crossing': 15...}
I first started to parse a single stat, corners by doing:
from bs4 import BeautifulSoup as bs
import requests
url = 'https://fmdataba.com/19/p/1165/lionel-messi/'
page = requests.get(url)
soup = bs(page.content, 'html.parser')
print(soup.prettify())
soup.find({"id": "fm_cro"})
but this returns an empty list.
Could anyone please help?

With bs4 4.7.1 you can use nth-child(odd) and nth-child(even) to get the different tds within each row to create your dict; and use :has and :contains to get the right table for each keyword and build your outer dict to house each inner.
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://fmdataba.com/19/p/1165/lionel-messi/', headers = {'User-Agent':'Mozilla/5.0'})
soup = bs(r.content, 'lxml')
abilities = ['TECHNICAL', 'MENTAL' , 'PHYSICAL']
def get_abilities(soup, keyword):
table = soup.select_one('div:has(h3:contains("' + ability + '")) + div > table')
d = {item.select_one('td:nth-child(odd)').text: int(item.select_one('td:nth-child(even)').text) for item in table.select('tr')}
return d
results = {}
for ability in abilities:
results[ability] = get_abilities(soup, ability)
print(results)
Output:

CSS explanation:
The css selector line as follows:
soup.select_one('div:has(h3:contains("' + ability + '")) + div > table')
select_one is like select in that it applies the css selector within to the soup object but only returns the first match.
:has and :contains are pseudo classes like :nth-child(). Looking at the html in question for the first ability table here is an explanation of the parts:
Click on image to enlarge.

Additional reading:
You can also use pandas:
import pandas as pd
import requests
url = 'https://fmdataba.com/19/p/1165/lionel-messi/'
page = requests.get(url, headers={'User-Agent':'Mozilla/5.0'})
tables = pd.read_html(page.text)
all_data = {}
for idx, name in [(2, 'TECHNICAL'), (3, 'MENTAL'), (4, 'PHYSICAL')]:
tbl = tables[idx]
data = {r[0]: r[1] for _, r in tbl.iterrows()}
all_data[name] = data
tables[2] is the TECHNICAL table, tables[3] is the MENTAL table and tables[4] is the PHYSICAL table.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With