I am writing up a lit review and trying to us Python Web Scraping the abstracts etc. info about other research on a web.
For example, I'd like to extract the content of 'Transcript' from this webpage https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1414/rec/3 and wrote a Python code, but it seems not working at all and didn't extract anything:
from bs4 import BeautifulSoup
import requests
url = "https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1417/rec/4"
html = requests.get(url,verify=False)
soup = BeautifulSoup(html.text,'html.parser')
item = soup.find('span', {'data-id': 'itemText'})
print(item)
Here is also a screenshot of the inspect, I wanted to extract the text paragraph.
The data you're looking for is stored inside <script>
tag, so beautifulsoup
doesn't see it. You can use re
/json
module to parse it:
import re
import json
import requests
url = "https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1414/rec/3"
html_doc = requests.get(url).text
data = re.search(r"window\.__INITIAL_STATE__ = JSON.parse\((.*)\);", html_doc)
data = json.loads(json.loads(data.group(1)))
print(data["item"]["item"]["text"])
Prints:
This project will examine the economic impact of climate change, and climate change policy, on New Zealand households, families, and individuals. Price outputs and employment indices from Climate Change Commission models will be used with Treasury’s microsimulation model (TAWA) to model the impact on household incomes and expenditure due to different climate change mitigation pathways and policy settings.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With