Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting data from list in Python, after BeautifulSoup scrape, and creating Pandas table

I've been learning the basics of Python for a short while, and thought I'd go ahead and try to put something together, but appear to have hit a stumbling block (despite looking just about everywhere to see where I may be going wrong).

I'm trying to grab a table i.e. from here: https://www.oddschecker.com/horse-racing/2020-09-10-chelmsford-city/20:30/winner

Now I realize that the table isn't set out how typically a normal HTML would be, and therefore trying to grab this with Pandas wouldn't yield results. Therefore delved into BeautifulSoup to try and get a result.

It seems all the data I would need is within the class 'diff-row evTabRow bc' and therefore wrote the following:

url = requests.get('https://www.oddschecker.com/horse-racing/2020-09-10-haydock/14:00/winner')
soup = BeautifulSoup(url.content, 'lxml')
table = soup.find_all("tr", class_="diff-row evTabRow bc")

This seems to put each horse and all corresponding data I'd need for it, into a list. Within this list, I'd only need certain bits, i.e. "data-name" for the horse name, and "data-odig" for the current odds.

I thought there may be some way I could then extract the data from the list to build a list of lists, and then construct a data frame in Pandas, but I may be going about this all wrong.

like image 613
user994319 Avatar asked Sep 09 '20 19:09

user994319


People also ask

Can pandas be used for web scraping?

Pandas makes it easy to scrape a table ( <table> tag) on a web page. After obtaining it as a DataFrame, it is of course possible to do various processing and save it as an Excel file or csv file.


2 Answers

You can access any of the <tr> attributes with the BeautifulSoup object .attrs property.

Once you have table, loop over each entry, pull out the attributes you want as a list of dicts. Then initialize a Pandas data frame with the resulting list.

horse_attrs = list()

for entry in table:
    attrs = dict(name=entry.attrs['data-bname'], dig=entry.attrs['data-best-dig'])
    horse_attrs.append(attrs)

df = pd.DataFrame(horse_attrs)

df
                name   dig
0         Las Farras  9999
1         Heat Miami  9999
2        Martin Beck  9999
3             Litran  9999
4      Ritmo Capanga  9999
5      Perfect Score  9999
6   Simplemente Tuyo  9999
7            Anpacai  9999
8          Colt Fast  9999
9         Cacharpari  9999
10        Don Leparc  9999
11   Curioso Seattle  9999
12       Golpe Final  9999
13       El Acosador  9999

Notes:

  • The url you provided didn't work for me, but this similar one did: https://www.oddschecker.com/horse-racing/palermo-arg/21:00/winner
  • I didn't see the exact attributes (data-name and data-odig) you mentioned, so I used ones with similar names. I don't know enough about horse racing to know if these are useful, but the method in this answer should allow you to choose any of the attributes that are available.
like image 187
andrew_reece Avatar answered Oct 18 '22 06:10

andrew_reece


The data you are looking for is both in the row tag <tr> and in the cell tags <td>.

The issue is that not all of the <td>'s are useful, so you have to skip those.

import pandas as pd

from bs4 import BeautifulSoup
import requests

url   = requests.get('https://www.oddschecker.com/horse-racing/thirsk/13:00/winner')
soup  = BeautifulSoup(url.content, 'lxml')
rows = soup.find_all("tr", class_="diff-row evTabRow bc")

my_data = []
for row in rows:
    horse = row.attrs['data-bname']

    for td in row:
        if td.attrs['class'][0] != 'np':
            continue #Skip

        bookie = td['data-bk']
        odds   = td['data-odig']
        my_data.append(dict(
            horse  = horse,
            bookie = bookie,
            odds   = odds
        ))

df = pd.DataFrame(my_data)
print(df)

This will give you what you are looking for:

          horse bookie  odds
0    Just Frank     B3  3.75
1    Just Frank     SK  4.33
2    Just Frank     WH  4.33
3    Just Frank     EE  4.33
4    Just Frank     FB   4.2
..          ...    ...   ...
268     Tommy R     RZ    29
269     Tommy R     SX    26
270     Tommy R     BF  10.8
271     Tommy R     MK    41
272     Tommy R     MA    98

[273 rows x 3 columns]
like image 40
TKK Avatar answered Oct 18 '22 06:10

TKK