Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get all elements as rows for each href in HTML and add it to a pandas dataframe?

I am trying to fetch as rows the different values inside each href element from the following website: https://www.bmv.com.mx/es/mercados/capitales

There should be 1 row that matches each field on the provided headers for each different href element on the HTML file.

This is one of the portions of the HTML that I am trying to scrape:


  <tbody>
    
  <tr role="row" class="odd">
<td class="sorting_1"><a href="/es/mercados/cotizacion/1959">AC
  
</a></td><td><span class="series">*</span>
</td><td>03:20</td><td><span class="color-2">191.04

</span></td><td>191.32</td>
<td>194.51</td>
<td>193.92</td>
<td>191.01</td>
<td>380,544</td>
<td>73,122,008.42</td>
<td>2,793</td>
<td>-3.19</td><td>-1.64</td></tr><tr role="row" class="even">
  <td class="sorting_1"><a href="/es/mercados/cotizacion/203">ACCELSA</a>
  </td>
  <td><span class="series">B</span>
  </td><td>03:20</td><td>
    <span class="">22.5</span></td><td>0</td>
    <td>22.5</td><td>0</td><td>0

    </td><td>3</td><td>67.20</td>
    <td>1</td><td>0</td><td>0</td></tr>
    <tr role="row" class="odd">
      <td class="sorting_1">
        <a href="/es/mercados/cotizacion/6096">ACTINVR</a></td>
      <td><span class="series">B</span></td><td>03:20</td><td>
        <span class="">15.13</span></td><td>0</td><td>15.13</td><td>0</td>
        <td>0</td><td>13</td><td>196.69</td><td>4</td><td>0</td>
        <td>0</td></tr><tr role="row" class="even"><td class="sorting_1">
          <a href="/es/mercados/cotizacion/339083">AGUA</a></td>
          <td><span class="series">*</span>
          </td><td>03:20</td><td>
            <span class="color-1">29</span>
          </td><td>28.98</td><td>28.09</td>
            <td>29</td><td>28</td><td>296,871</td>
            <td>8,491,144.74</td><td>2,104</td><td>0.89</td>
            <td>3.17</td></tr><tr role="row" class="odd"><td class="sorting_1">
              <a href="/es/mercados/cotizacion/30">ALFA</a></td><td><span class="series">A</span></td>
              <td>03:20</td>
              <td><span class="color-2">13.48</span>
              </td><td>13.46</td>
              <td>13.53</td><td>13.62</td><td>13.32</td>
              <td>2,706,398</td>
              td>36,494,913.42</td><td>7,206</td><td>-0.07</td>
              <td>-0.52</td>
            </tr><tr role="row" class="even"><td class="sorting_1">
              <a href="/es/mercados/cotizacion/7684">ALPEK</a></td><td><span class="series">A</span>
              </td><td>03:20</td><td><span class="color-2">10.65</span>
            </td><td>10.64</td><td>10.98</td><td>10.88</td><td>10.53</td>
            <td>1,284,847</td><td>13,729,368.46</td><td>6,025</td><td>-0.34</td>
            <td>-3.10</td></tr><tr role="row" class="odd"><td class="sorting_1">
              <a href="/es/mercados/cotizacion/1729">ALSEA</a></td><td><span class="series">*</span>
            </td><td>03:20</td><td><span class="color-2">65.08</span></td><td>64.94</td><td>65.44</td><td>66.78</td><td>64.66</td><td>588,826</td><td>38,519,244.51</td><td>4,442</td><td>-0.5</td><td>-0.76</td></tr>
            <tr role="row" class="even"><td class="sorting_1">
              <a href="/es/mercados/cotizacion/424518">ALTERNA</a></td><td><span class="series">B</span></td><td>03:20</td><td><span class="">1.5</span></td><td>0</td><td>1.5</td>
              <td>0</td><td>0</td><td>2</td><td>3</td><td>1</td><td>0</td><td>0</td></tr><tr role="row" class="odd"><td class="sorting_1">
              <a href="/es/mercados/cotizacion/1862">AMX</a></td>
              <td><span class="series">B</span></td><td>03:20</td>
              <td><span class="color-2">14.56</span></td><td>14.58</td>
              <td>14.69</td><td>14.68</td><td>14.5</td><td>86,023,759</td>
              <td>1,254,412,623.59</td><td>41,913</td><td>-0.11</td>
              <td>-0.75</td></tr><tr role="row" class="even">
                <td class="sorting_1"><a href="/es/mercados/cotizacion/6507">ANGELD</a>
              </td><td><span class="series">10</span></td><td>03:20</td><td>
                <span class="color-2">21.09</span>
              </td><td>21.1</td><td>21.44</td><td>21.23</td><td>21.09</td>
              <td>51,005</td><td>1,076,281.67</td>
              <td>22</td><td>-0.34</td><td>-1.59</td></tr>
      </tbody>

And my current code results into an empty dataframe:

# create empty pandas dataframe
import pandas as pd
import requests
from bs4 import BeautifulSoup


# get response code from webhost
page = requests.get('https://www.bmv.com.mx/es/mercados/capitales')
soup = BeautifulSoup(page.text, 'lxml')
#print(soup.p.text)
# yet it doesn't bring the expected rows!

print('Read html!')

# get headers

tbody = soup.find("thead")
tr = tbody.find_all("tr")

headers= [t.get_text().strip().replace('\n', ',').split(',') for t in tr][0]

#print(headers)

df = pd.DataFrame(columns=headers)

# fetch rows into pandas dataframe# You can find children with multiple tags by passing a list of strings
rows = soup.find_all('tr', {"role":"row"})
#rows

for row in rows:
    cells = row.findChildren('td')
    for cell in cells:
        value = cell.string

        #print("The value in this cell is %s" % value)

        # append row in dataframe

I would like to know if it's possible to get a pandas dataframe whose fields are the ones portrayed in the headers list and the rows are each element from href.

For better perspective, the expected output should be equal to the table at the bottom of the provided website. Whose first row has the next schema:

EMISORA SERIE   HORA    ÚLTIMO   PPP    ANTERIOR    MÁXIMO  MÍNIMO VOLUMEN  IMPORTE OPS.    VAR PUNTOS  VAR %
AC        *    3:20    191.04   191.32  194.51     193.92   191.01  380,544  73,122,008.42   2,793  -3.19    -1.64

Is this possible to create such dataset?

like image 710
AlSub Avatar asked Oct 14 '25 03:10

AlSub


1 Answers

As mentioned before, the table is loaded and rendered dynamically via JavaScript, something you could not handle with requests because it just get the static response and does not behave like a browser.

A solution to mimic a browsers behaviour is given by @thetaco using selenium but you could get your goal also with requests while using the source the data comes from.

  1. Get the request url use your browsers dev tools to inspect the network traffic in this example it is: https://www.bmv.com.mx/es/Grupo_BMV/BmvJsonGeneric?idSitioPagina=4

  2. Extract the string from the response (it is not valid JSON)

    requests.get('https://www.bmv.com.mx/es/Grupo_BMV/BmvJsonGeneric?idSitioPagina=4').text.split(';(', 1)[-1].split(')')[0]
    
  3. Convert the string into JSON (json.loads()) and tranform it with pandas.json_normalize() into a dataframe. Your data is under the path ['response']['resultado']['A']

  4. The column names may differ a bit because they are build on the keyes from the JSON but they could be easily mapped.

The response contains all content, including that of the other groups (ACCIONES, CKD'S, FIBRAS, TÍTULOS OPCIONALES) which can also be extracted (A, CKDS, F, TO) would be the abbreviations that can be used analogously for the selection.

Example (all available information for ACCIONES from XHR Request)
import json, requests
import pandas as pd

df = pd.json_normalize(
    json.loads(
        requests.get('https://www.bmv.com.mx/es/Grupo_BMV/BmvJsonGeneric?idSitioPagina=4')\
            .text\
            .split(';(', 1)[-1]\   
            .split(')')[0]
        )['response']['resultado']['A']
)\
.dropna(axis=1, how='all')
idEmision idTpvalor cveSerie cveCorta idEmisora datosEstadistica.hora datosEstadistica.maximo datosEstadistica.minimo datosEstadistica.importeAcomulado datosEstadistica.noOperaciones datosEstadistica.variacionPuntos datosEstadistica.variacionPorcentual datosEstadistica.precioUltimoHecho datosEstadistica.ppp datosEstadistica.precioAnterior datosEstadistica.volumenOperado datosEstadistica.anioEjercicio datosEstadistica.insumosPu
0 1959 1 * AC 6081 03:20 192.98 189.01 9.54831e+07 3333 -2.59 -1.35 189.3 189.32 191.91 502297 0 0
1 203 1 B ACCELSA 5015 03:20 0 0 22.4 1 0 0 22.5 0 22.5 1 0 0
...
103 404833 1B 19 VMEX 34347 03:20 45.29 45.29 11007.9 8 0.14 0.31 45.29 0 45.15 243 0 0
104 327336 1 A VOLAR 30023 03:20 12.76 12.42 1.5744e+07 5006 0.24 1.93 12.67 12.68 12.44 1246397 0 0
105 5 1 * WALMEX 5214 03:20 70.37 67.83 1.21326e+09 19593 -2.02 -2.86 68.7 68.72 70.74 17639588 0 0

Coming closer to your result, you could post process the dataframe to your needs:

import re
# exclude all columns referencing an id information
df = df.loc[:, ~df.columns.str.startswith('id')]
# adjust the column names
df.columns = [re.sub(r"(?<=\w)([A-Z])", r" \1", c).split('.')[-1].lstrip('cve').upper() for c in df.columns]
df
SERIE CORTA HORA MAXIMO MINIMO IMPORTE ACOMULADO NO OPERACIONES ARIACION PUNTOS ARIACION PORCENTUAL PRECIO ULTIMO HECHO PPP PRECIO ANTERIOR OLUMEN OPERADO ANIO EJERCICIO INSUMOS PU
* AC 03:20 191.17 187.8 1.14863e+08 4175 0.64 0.34 189.65 189.96 189.32 604632 0 0
B ACTINVR 03:20 15.03 15.03 36614.4 14 0 0 15.03 0 15.03 2436 0 0
...
A VOLAR 03:20 12.97 12.51 1.48613e+07 2832 0.07 0.55 12.83 12.75 12.68 1162684 0 0
* WALMEX 03:20 69.03 67.66 7.2698e+08 22462 -0.71 -1.03 68 68.01 68.72 10672270 0 0

or simply map against the columns, to get exact column names:

map_dict = {'cveSerie':'SERIE', 'cveCorta':'EMISORA', 'datosEstadistica.hora':'HORA', 'datosEstadistica.maximo':'MÁXIMO',
       'datosEstadistica.minimo':'MÍNIMO', 'datosEstadistica.importeAcomulado':'IMPORTE', 'datosEstadistica.noOperaciones':'OPS.', 'datosEstadistica.variacionPuntos':'VAR PUNTOS',
       'datosEstadistica.variacionPorcentual':'VAR %', 'datosEstadistica.precioUltimoHecho':'ÚLTIMO', 'datosEstadistica.ppp':'PPP',
       'datosEstadistica.precioAnterior':'ANTERIOR', 'datosEstadistica.volumenOperado':'VOLUMEN'}
df.loc[:,[c for c in df.columns if c in map_dict.keys()]].rename(columns=map_dict)
SERIE EMISORA HORA MÁXIMO MÍNIMO IMPORTE OPS. VAR PUNTOS VAR % ÚLTIMO PPP ANTERIOR VOLUMEN
* AC 03:20 191.17 187.8 1.14863e+08 4175 0.64 0.34 189.65 189.96 189.32 604632

...

like image 118
HedgeHog Avatar answered Oct 16 '25 17:10

HedgeHog



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!