Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract data from embedded script tag in html

I'm trying to fetch data inside a (big) script tag within HTML. By using Beautifulsoup I can approach the necessary script, yet I cannot get the data I want.

What I'm looking for inside this tag resides within a list called "Beleidsdekkingsgraad" more specifically ["Beleidsdekkingsgraad","107,6","107,6","109,1","109,8","110,1","111,5","112,5","113,3","113,3","114,3","115,7","116,3","116,9","117,5","117,8","118,1","118,3","118,4","118,6","118,8","118,9","118,9","118,9","118,5","118,1","117,8","117,6","117,5","117,1","116,7","116,2"] even more specific; the last entry in the list (116,2)

Following 1 or 2 cannot get the case done.

What I've done so far

base='https://e.infogr.am/pob_dekkingsgraadgrafiek?src=embed#async_embed'
url=requests.get(base)
soup=BeautifulSoup(url.text, 'html.parser')
all_scripts = soup.find_all('script')
all_scripts[3].get_text()[1907:2179]

This, however, is not satisfying since each time the indexing has to be changed if new numbers are added.

What I'm looking for an easy way to extract the list from the script tag, second to catch the last number of the extracted list (i.e. 116,2)

like image 451
Wokkel Avatar asked May 16 '26 03:05

Wokkel


1 Answers

You could regex out javascript object holding that item then parse with json library

import requests,re,json

r = requests.get('https://e.infogr.am/pob_dekkingsgraadgrafiek?src=embed#async_embed')
p = re.compile(r'window\.infographicData=(.*);')
data = json.loads(p.findall(r.text)[0])
result = [i for i in data['elements'][1]['data'][0] if 'Beleidsdekkingsgraad' in i][0][-1]
print(result)

Or do whole thing with regex:

import requests,re

r = requests.get('https://e.infogr.am/pob_dekkingsgraadgrafiek?src=embed#async_embed')
p = re.compile(r'\["Beleidsdekkingsgraad".+?,"([0-9,]+)"\]')
print(p.findall(r.text)[0])

Second regex:

enter image description here


Another option:

import requests,re, json

r = requests.get('https://e.infogr.am/pob_dekkingsgraadgrafiek?src=embed#async_embed')
p = re.compile(r'(\["Beleidsdekkingsgraad".+?"\])')
print(json.loads(p.findall(r.text)[0])[-1])
like image 145
QHarr Avatar answered May 18 '26 21:05

QHarr



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!