I am trying to fetch as rows the different values inside each href element from the following website: https://www.bmv.com.mx/es/mercados/capitales
There should be 1 row that matches each field on the provided headers for each different href
element on the HTML file.
This is one of the portions of the HTML that I am trying to scrape:
<tbody>
<tr role="row" class="odd">
<td class="sorting_1"><a href="/es/mercados/cotizacion/1959">AC
</a></td><td><span class="series">*</span>
</td><td>03:20</td><td><span class="color-2">191.04
</span></td><td>191.32</td>
<td>194.51</td>
<td>193.92</td>
<td>191.01</td>
<td>380,544</td>
<td>73,122,008.42</td>
<td>2,793</td>
<td>-3.19</td><td>-1.64</td></tr><tr role="row" class="even">
<td class="sorting_1"><a href="/es/mercados/cotizacion/203">ACCELSA</a>
</td>
<td><span class="series">B</span>
</td><td>03:20</td><td>
<span class="">22.5</span></td><td>0</td>
<td>22.5</td><td>0</td><td>0
</td><td>3</td><td>67.20</td>
<td>1</td><td>0</td><td>0</td></tr>
<tr role="row" class="odd">
<td class="sorting_1">
<a href="/es/mercados/cotizacion/6096">ACTINVR</a></td>
<td><span class="series">B</span></td><td>03:20</td><td>
<span class="">15.13</span></td><td>0</td><td>15.13</td><td>0</td>
<td>0</td><td>13</td><td>196.69</td><td>4</td><td>0</td>
<td>0</td></tr><tr role="row" class="even"><td class="sorting_1">
<a href="/es/mercados/cotizacion/339083">AGUA</a></td>
<td><span class="series">*</span>
</td><td>03:20</td><td>
<span class="color-1">29</span>
</td><td>28.98</td><td>28.09</td>
<td>29</td><td>28</td><td>296,871</td>
<td>8,491,144.74</td><td>2,104</td><td>0.89</td>
<td>3.17</td></tr><tr role="row" class="odd"><td class="sorting_1">
<a href="/es/mercados/cotizacion/30">ALFA</a></td><td><span class="series">A</span></td>
<td>03:20</td>
<td><span class="color-2">13.48</span>
</td><td>13.46</td>
<td>13.53</td><td>13.62</td><td>13.32</td>
<td>2,706,398</td>
td>36,494,913.42</td><td>7,206</td><td>-0.07</td>
<td>-0.52</td>
</tr><tr role="row" class="even"><td class="sorting_1">
<a href="/es/mercados/cotizacion/7684">ALPEK</a></td><td><span class="series">A</span>
</td><td>03:20</td><td><span class="color-2">10.65</span>
</td><td>10.64</td><td>10.98</td><td>10.88</td><td>10.53</td>
<td>1,284,847</td><td>13,729,368.46</td><td>6,025</td><td>-0.34</td>
<td>-3.10</td></tr><tr role="row" class="odd"><td class="sorting_1">
<a href="/es/mercados/cotizacion/1729">ALSEA</a></td><td><span class="series">*</span>
</td><td>03:20</td><td><span class="color-2">65.08</span></td><td>64.94</td><td>65.44</td><td>66.78</td><td>64.66</td><td>588,826</td><td>38,519,244.51</td><td>4,442</td><td>-0.5</td><td>-0.76</td></tr>
<tr role="row" class="even"><td class="sorting_1">
<a href="/es/mercados/cotizacion/424518">ALTERNA</a></td><td><span class="series">B</span></td><td>03:20</td><td><span class="">1.5</span></td><td>0</td><td>1.5</td>
<td>0</td><td>0</td><td>2</td><td>3</td><td>1</td><td>0</td><td>0</td></tr><tr role="row" class="odd"><td class="sorting_1">
<a href="/es/mercados/cotizacion/1862">AMX</a></td>
<td><span class="series">B</span></td><td>03:20</td>
<td><span class="color-2">14.56</span></td><td>14.58</td>
<td>14.69</td><td>14.68</td><td>14.5</td><td>86,023,759</td>
<td>1,254,412,623.59</td><td>41,913</td><td>-0.11</td>
<td>-0.75</td></tr><tr role="row" class="even">
<td class="sorting_1"><a href="/es/mercados/cotizacion/6507">ANGELD</a>
</td><td><span class="series">10</span></td><td>03:20</td><td>
<span class="color-2">21.09</span>
</td><td>21.1</td><td>21.44</td><td>21.23</td><td>21.09</td>
<td>51,005</td><td>1,076,281.67</td>
<td>22</td><td>-0.34</td><td>-1.59</td></tr>
</tbody>
And my current code results into an empty dataframe
:
# create empty pandas dataframe
import pandas as pd
import requests
from bs4 import BeautifulSoup
# get response code from webhost
page = requests.get('https://www.bmv.com.mx/es/mercados/capitales')
soup = BeautifulSoup(page.text, 'lxml')
#print(soup.p.text)
# yet it doesn't bring the expected rows!
print('Read html!')
# get headers
tbody = soup.find("thead")
tr = tbody.find_all("tr")
headers= [t.get_text().strip().replace('\n', ',').split(',') for t in tr][0]
#print(headers)
df = pd.DataFrame(columns=headers)
# fetch rows into pandas dataframe# You can find children with multiple tags by passing a list of strings
rows = soup.find_all('tr', {"role":"row"})
#rows
for row in rows:
cells = row.findChildren('td')
for cell in cells:
value = cell.string
#print("The value in this cell is %s" % value)
# append row in dataframe
I would like to know if it's possible to get a pandas
dataframe whose fields are the ones portrayed in the headers list and the rows are each element from href.
For better perspective, the expected output should be equal to the table at the bottom of the provided website. Whose first row has the next schema:
EMISORA SERIE HORA ÚLTIMO PPP ANTERIOR MÁXIMO MÍNIMO VOLUMEN IMPORTE OPS. VAR PUNTOS VAR %
AC * 3:20 191.04 191.32 194.51 193.92 191.01 380,544 73,122,008.42 2,793 -3.19 -1.64
Is this possible to create such dataset?
As mentioned before, the table is loaded and rendered dynamically via JavaScript, something you could not handle with requests
because it just get the static response and does not behave like a browser.
A solution to mimic a browsers behaviour is given by @thetaco using selenium
but you could get your goal also with requests
while using the source the data comes from.
Get the request url use your browsers dev tools to inspect the network traffic in this example it is: https://www.bmv.com.mx/es/Grupo_BMV/BmvJsonGeneric?idSitioPagina=4
Extract the string from the response (it is not valid JSON)
requests.get('https://www.bmv.com.mx/es/Grupo_BMV/BmvJsonGeneric?idSitioPagina=4').text.split(';(', 1)[-1].split(')')[0]
Convert the string into JSON (json.loads()
) and tranform it with pandas.json_normalize()
into a dataframe. Your data is under the path ['response']['resultado']['A']
The column names may differ a bit because they are build on the keyes from the JSON but they could be easily mapped.
The response contains all content, including that of the other groups (ACCIONES, CKD'S, FIBRAS, TÍTULOS OPCIONALES
) which can also be extracted (A, CKDS, F, TO
) would be the abbreviations that can be used analogously for the selection.
import json, requests
import pandas as pd
df = pd.json_normalize(
json.loads(
requests.get('https://www.bmv.com.mx/es/Grupo_BMV/BmvJsonGeneric?idSitioPagina=4')\
.text\
.split(';(', 1)[-1]\
.split(')')[0]
)['response']['resultado']['A']
)\
.dropna(axis=1, how='all')
idEmision | idTpvalor | cveSerie | cveCorta | idEmisora | datosEstadistica.hora | datosEstadistica.maximo | datosEstadistica.minimo | datosEstadistica.importeAcomulado | datosEstadistica.noOperaciones | datosEstadistica.variacionPuntos | datosEstadistica.variacionPorcentual | datosEstadistica.precioUltimoHecho | datosEstadistica.ppp | datosEstadistica.precioAnterior | datosEstadistica.volumenOperado | datosEstadistica.anioEjercicio | datosEstadistica.insumosPu | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1959 | 1 | * | AC | 6081 | 03:20 | 192.98 | 189.01 | 9.54831e+07 | 3333 | -2.59 | -1.35 | 189.3 | 189.32 | 191.91 | 502297 | 0 | 0 |
1 | 203 | 1 | B | ACCELSA | 5015 | 03:20 | 0 | 0 | 22.4 | 1 | 0 | 0 | 22.5 | 0 | 22.5 | 1 | 0 | 0 |
... | ||||||||||||||||||
103 | 404833 | 1B | 19 | VMEX | 34347 | 03:20 | 45.29 | 45.29 | 11007.9 | 8 | 0.14 | 0.31 | 45.29 | 0 | 45.15 | 243 | 0 | 0 |
104 | 327336 | 1 | A | VOLAR | 30023 | 03:20 | 12.76 | 12.42 | 1.5744e+07 | 5006 | 0.24 | 1.93 | 12.67 | 12.68 | 12.44 | 1246397 | 0 | 0 |
105 | 5 | 1 | * | WALMEX | 5214 | 03:20 | 70.37 | 67.83 | 1.21326e+09 | 19593 | -2.02 | -2.86 | 68.7 | 68.72 | 70.74 | 17639588 | 0 | 0 |
Coming closer to your result, you could post process the dataframe
to your needs:
import re
# exclude all columns referencing an id information
df = df.loc[:, ~df.columns.str.startswith('id')]
# adjust the column names
df.columns = [re.sub(r"(?<=\w)([A-Z])", r" \1", c).split('.')[-1].lstrip('cve').upper() for c in df.columns]
df
SERIE | CORTA | HORA | MAXIMO | MINIMO | IMPORTE ACOMULADO | NO OPERACIONES | ARIACION PUNTOS | ARIACION PORCENTUAL | PRECIO ULTIMO HECHO | PPP | PRECIO ANTERIOR | OLUMEN OPERADO | ANIO EJERCICIO | INSUMOS PU |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
* | AC | 03:20 | 191.17 | 187.8 | 1.14863e+08 | 4175 | 0.64 | 0.34 | 189.65 | 189.96 | 189.32 | 604632 | 0 | 0 |
B | ACTINVR | 03:20 | 15.03 | 15.03 | 36614.4 | 14 | 0 | 0 | 15.03 | 0 | 15.03 | 2436 | 0 | 0 |
... | ||||||||||||||
A | VOLAR | 03:20 | 12.97 | 12.51 | 1.48613e+07 | 2832 | 0.07 | 0.55 | 12.83 | 12.75 | 12.68 | 1162684 | 0 | 0 |
* | WALMEX | 03:20 | 69.03 | 67.66 | 7.2698e+08 | 22462 | -0.71 | -1.03 | 68 | 68.01 | 68.72 | 10672270 | 0 | 0 |
or simply map against the columns, to get exact column names:
map_dict = {'cveSerie':'SERIE', 'cveCorta':'EMISORA', 'datosEstadistica.hora':'HORA', 'datosEstadistica.maximo':'MÁXIMO',
'datosEstadistica.minimo':'MÍNIMO', 'datosEstadistica.importeAcomulado':'IMPORTE', 'datosEstadistica.noOperaciones':'OPS.', 'datosEstadistica.variacionPuntos':'VAR PUNTOS',
'datosEstadistica.variacionPorcentual':'VAR %', 'datosEstadistica.precioUltimoHecho':'ÚLTIMO', 'datosEstadistica.ppp':'PPP',
'datosEstadistica.precioAnterior':'ANTERIOR', 'datosEstadistica.volumenOperado':'VOLUMEN'}
df.loc[:,[c for c in df.columns if c in map_dict.keys()]].rename(columns=map_dict)
SERIE | EMISORA | HORA | MÁXIMO | MÍNIMO | IMPORTE | OPS. | VAR PUNTOS | VAR % | ÚLTIMO | PPP | ANTERIOR | VOLUMEN |
---|---|---|---|---|---|---|---|---|---|---|---|---|
* | AC | 03:20 | 191.17 | 187.8 | 1.14863e+08 | 4175 | 0.64 | 0.34 | 189.65 | 189.96 | 189.32 | 604632 |
...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With