How to get all elements as rows for each href in HTML and add it to a pandas dataframe?

Question

I am trying to fetch as rows the different values inside each href element from the following website: https://www.bmv.com.mx/es/mercados/capitales

There should be 1 row that matches each field on the provided headers for each different href element on the HTML file.

This is one of the portions of the HTML that I am trying to scrape:


  <tbody>
    
  <tr role="row" class="odd">
<td class="sorting_1"><a href="/es/mercados/cotizacion/1959">AC
  
</a></td><td><span class="series">*</span>
</td><td>03:20</td><td><span class="color-2">191.04

</span></td><td>191.32</td>
<td>194.51</td>
<td>193.92</td>
<td>191.01</td>
<td>380,544</td>
<td>73,122,008.42</td>
<td>2,793</td>
<td>-3.19</td><td>-1.64</td></tr><tr role="row" class="even">
  <td class="sorting_1"><a href="/es/mercados/cotizacion/203">ACCELSA</a>
  </td>
  <td><span class="series">B</span>
  </td><td>03:20</td><td>
    <span class="">22.5</span></td><td>0</td>
    <td>22.5</td><td>0</td><td>0

    </td><td>3</td><td>67.20</td>
    <td>1</td><td>0</td><td>0</td></tr>
    <tr role="row" class="odd">
      <td class="sorting_1">
        <a href="/es/mercados/cotizacion/6096">ACTINVR</a></td>
      <td><span class="series">B</span></td><td>03:20</td><td>
        <span class="">15.13</span></td><td>0</td><td>15.13</td><td>0</td>
        <td>0</td><td>13</td><td>196.69</td><td>4</td><td>0</td>
        <td>0</td></tr><tr role="row" class="even"><td class="sorting_1">
          <a href="/es/mercados/cotizacion/339083">AGUA</a></td>
          <td><span class="series">*</span>
          </td><td>03:20</td><td>
            <span class="color-1">29</span>
          </td><td>28.98</td><td>28.09</td>
            <td>29</td><td>28</td><td>296,871</td>
            <td>8,491,144.74</td><td>2,104</td><td>0.89</td>
            <td>3.17</td></tr><tr role="row" class="odd"><td class="sorting_1">
              <a href="/es/mercados/cotizacion/30">ALFA</a></td><td><span class="series">A</span></td>
              <td>03:20</td>
              <td><span class="color-2">13.48</span>
              </td><td>13.46</td>
              <td>13.53</td><td>13.62</td><td>13.32</td>
              <td>2,706,398</td>
              td>36,494,913.42</td><td>7,206</td><td>-0.07</td>
              <td>-0.52</td>
            </tr><tr role="row" class="even"><td class="sorting_1">
              <a href="/es/mercados/cotizacion/7684">ALPEK</a></td><td><span class="series">A</span>
              </td><td>03:20</td><td><span class="color-2">10.65</span>
            </td><td>10.64</td><td>10.98</td><td>10.88</td><td>10.53</td>
            <td>1,284,847</td><td>13,729,368.46</td><td>6,025</td><td>-0.34</td>
            <td>-3.10</td></tr><tr role="row" class="odd"><td class="sorting_1">
              <a href="/es/mercados/cotizacion/1729">ALSEA</a></td><td><span class="series">*</span>
            </td><td>03:20</td><td><span class="color-2">65.08</span></td><td>64.94</td><td>65.44</td><td>66.78</td><td>64.66</td><td>588,826</td><td>38,519,244.51</td><td>4,442</td><td>-0.5</td><td>-0.76</td></tr>
            <tr role="row" class="even"><td class="sorting_1">
              <a href="/es/mercados/cotizacion/424518">ALTERNA</a></td><td><span class="series">B</span></td><td>03:20</td><td><span class="">1.5</span></td><td>0</td><td>1.5</td>
              <td>0</td><td>0</td><td>2</td><td>3</td><td>1</td><td>0</td><td>0</td></tr><tr role="row" class="odd"><td class="sorting_1">
              <a href="/es/mercados/cotizacion/1862">AMX</a></td>
              <td><span class="series">B</span></td><td>03:20</td>
              <td><span class="color-2">14.56</span></td><td>14.58</td>
              <td>14.69</td><td>14.68</td><td>14.5</td><td>86,023,759</td>
              <td>1,254,412,623.59</td><td>41,913</td><td>-0.11</td>
              <td>-0.75</td></tr><tr role="row" class="even">
                <td class="sorting_1"><a href="/es/mercados/cotizacion/6507">ANGELD</a>
              </td><td><span class="series">10</span></td><td>03:20</td><td>
                <span class="color-2">21.09</span>
              </td><td>21.1</td><td>21.44</td><td>21.23</td><td>21.09</td>
              <td>51,005</td><td>1,076,281.67</td>
              <td>22</td><td>-0.34</td><td>-1.59</td></tr>
      </tbody>

And my current code results into an empty dataframe:

# create empty pandas dataframe
import pandas as pd
import requests
from bs4 import BeautifulSoup


# get response code from webhost
page = requests.get('https://www.bmv.com.mx/es/mercados/capitales')
soup = BeautifulSoup(page.text, 'lxml')
#print(soup.p.text)
# yet it doesn't bring the expected rows!

print('Read html!')

# get headers

tbody = soup.find("thead")
tr = tbody.find_all("tr")

headers= [t.get_text().strip().replace('
', ',').split(',') for t in tr][0]

#print(headers)

df = pd.DataFrame(columns=headers)

# fetch rows into pandas dataframe# You can find children with multiple tags by passing a list of strings
rows = soup.find_all('tr', {"role":"row"})
#rows

for row in rows:
    cells = row.findChildren('td')
    for cell in cells:
        value = cell.string

        #print("The value in this cell is %s" % value)

        # append row in dataframe

I would like to know if it's possible to get a pandas dataframe whose fields are the ones portrayed in the headers list and the rows are each element from href.

For better perspective, the expected output should be equal to the table at the bottom of the provided website. Whose first row has the next schema:

EMISORA SERIE   HORA    ÚLTIMO   PPP    ANTERIOR    MÁXIMO  MÍNIMO VOLUMEN  IMPORTE OPS.    VAR PUNTOS  VAR %
AC        *    3:20    191.04   191.32  194.51     193.92   191.01  380,544  73,122,008.42   2,793  -3.19    -1.64

Is this possible to create such dataset?

HedgeHog · Accepted Answer

As mentioned before, the table is loaded and rendered dynamically via JavaScript, something you could not handle with requests because it just get the static response and does not behave like a browser.

A solution to mimic a browsers behaviour is given by @thetaco using selenium but you could get your goal also with requests while using the source the data comes from.

Get the request url use your browsers dev tools to inspect the network traffic in this example it is: https://www.bmv.com.mx/es/Grupo_BMV/BmvJsonGeneric?idSitioPagina=4

Extract the string from the response (it is not valid JSON)

requests.get('https://www.bmv.com.mx/es/Grupo_BMV/BmvJsonGeneric?idSitioPagina=4').text.split(';(', 1)[-1].split(')')[0]

Convert the string into JSON (json.loads()) and tranform it with pandas.json_normalize() into a dataframe. Your data is under the path ['response']['resultado']['A']
The column names may differ a bit because they are build on the keyes from the JSON but they could be easily mapped.

The response contains all content, including that of the other groups (ACCIONES, CKD'S, FIBRAS, TÍTULOS OPCIONALES) which can also be extracted (A, CKDS, F, TO) would be the abbreviations that can be used analogously for the selection.

Example (all available information for ACCIONES from XHR Request)

import json, requests
import pandas as pd

df = pd.json_normalize(
    json.loads(
        requests.get('https://www.bmv.com.mx/es/Grupo_BMV/BmvJsonGeneric?idSitioPagina=4')\
            .text\
            .split(';(', 1)[-1]\   
            .split(')')[0]
        )['response']['resultado']['A']
)\
.dropna(axis=1, how='all')

	idEmision	idTpvalor	cveSerie	cveCorta	idEmisora	datosEstadistica.hora	datosEstadistica.maximo	datosEstadistica.minimo	datosEstadistica.importeAcomulado	datosEstadistica.noOperaciones	datosEstadistica.variacionPuntos	datosEstadistica.variacionPorcentual	datosEstadistica.precioUltimoHecho	datosEstadistica.ppp	datosEstadistica.precioAnterior	datosEstadistica.volumenOperado	datosEstadistica.anioEjercicio	datosEstadistica.insumosPu
0	1959	1	*	AC	6081	03:20	192.98	189.01	9.54831e+07	3333	-2.59	-1.35	189.3	189.32	191.91	502297	0	0
1	203	1	B	ACCELSA	5015	03:20	0	0	22.4	1	0	0	22.5	0	22.5	1	0	0
...
103	404833	1B	19	VMEX	34347	03:20	45.29	45.29	11007.9	8	0.14	0.31	45.29	0	45.15	243	0	0
104	327336	1	A	VOLAR	30023	03:20	12.76	12.42	1.5744e+07	5006	0.24	1.93	12.67	12.68	12.44	1246397	0	0
105	5	1	*	WALMEX	5214	03:20	70.37	67.83	1.21326e+09	19593	-2.02	-2.86	68.7	68.72	70.74	17639588	0	0

Coming closer to your result, you could post process the dataframe to your needs:

import re
# exclude all columns referencing an id information
df = df.loc[:, ~df.columns.str.startswith('id')]
# adjust the column names
df.columns = [re.sub(r"(?<=\w)([A-Z])", r" \1", c).split('.')[-1].lstrip('cve').upper() for c in df.columns]
df

SERIE	CORTA	HORA	MAXIMO	MINIMO	IMPORTE ACOMULADO	NO OPERACIONES	ARIACION PUNTOS	ARIACION PORCENTUAL	PRECIO ULTIMO HECHO	PPP	PRECIO ANTERIOR	OLUMEN OPERADO	ANIO EJERCICIO	INSUMOS PU
*	AC	03:20	191.17	187.8	1.14863e+08	4175	0.64	0.34	189.65	189.96	189.32	604632	0	0
B	ACTINVR	03:20	15.03	15.03	36614.4	14	0	0	15.03	0	15.03	2436	0	0
...
A	VOLAR	03:20	12.97	12.51	1.48613e+07	2832	0.07	0.55	12.83	12.75	12.68	1162684	0	0
*	WALMEX	03:20	69.03	67.66	7.2698e+08	22462	-0.71	-1.03	68	68.01	68.72	10672270	0	0

or simply map against the columns, to get exact column names:

map_dict = {'cveSerie':'SERIE', 'cveCorta':'EMISORA', 'datosEstadistica.hora':'HORA', 'datosEstadistica.maximo':'MÁXIMO',
       'datosEstadistica.minimo':'MÍNIMO', 'datosEstadistica.importeAcomulado':'IMPORTE', 'datosEstadistica.noOperaciones':'OPS.', 'datosEstadistica.variacionPuntos':'VAR PUNTOS',
       'datosEstadistica.variacionPorcentual':'VAR %', 'datosEstadistica.precioUltimoHecho':'ÚLTIMO', 'datosEstadistica.ppp':'PPP',
       'datosEstadistica.precioAnterior':'ANTERIOR', 'datosEstadistica.volumenOperado':'VOLUMEN'}
df.loc[:,[c for c in df.columns if c in map_dict.keys()]].rename(columns=map_dict)

SERIE	EMISORA	HORA	MÁXIMO	MÍNIMO	IMPORTE	OPS.	VAR PUNTOS	VAR %	ÚLTIMO	PPP	ANTERIOR	VOLUMEN
*	AC	03:20	191.17	187.8	1.14863e+08	4175	0.64	0.34	189.65	189.96	189.32	604632

...

How to get all elements as rows for each href in HTML and add it to a pandas dataframe?

Tags:

python

pandas

dataframe

python-requests

web-scraping

AlSub

1 Answers

Example (all available information for ACCIONES from XHR Request)

HedgeHog

Recent Activity

Donate For Us

How to get all elements as rows for each href in HTML and add it to a pandas dataframe?

Tags:

python

pandas

dataframe

python-requests

web-scraping

AlSub

1 Answers

Example (all available information for ACCIONES from XHR Request)

HedgeHog

Related questions

Recent Activity

Donate For Us