Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is my web scraping code not extracting any content?

I am writing up a lit review and trying to us Python Web Scraping the abstracts etc. info about other research on a web.

For example, I'd like to extract the content of 'Transcript' from this webpage https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1414/rec/3 and wrote a Python code, but it seems not working at all and didn't extract anything:

from bs4 import BeautifulSoup
import requests

url = "https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1417/rec/4"
html = requests.get(url,verify=False)

soup = BeautifulSoup(html.text,'html.parser')
item = soup.find('span', {'data-id': 'itemText'})
print(item)

Here is also a screenshot of the inspect, I wanted to extract the text paragraph.

screenshot

like image 200
tgallavich Avatar asked Sep 03 '25 17:09

tgallavich


1 Answers

The data you're looking for is stored inside <script> tag, so beautifulsoup doesn't see it. You can use re/json module to parse it:

import re
import json
import requests

url = "https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1414/rec/3"
html_doc = requests.get(url).text

data = re.search(r"window\.__INITIAL_STATE__ = JSON.parse\((.*)\);", html_doc)
data = json.loads(json.loads(data.group(1)))

print(data["item"]["item"]["text"])

Prints:

This project will examine the economic impact of climate change, and climate change policy, on New Zealand households, families, and individuals. Price outputs and employment indices from Climate Change Commission models will be used with Treasury’s microsimulation model (TAWA) to model the impact on household incomes and expenditure due to different climate change mitigation pathways and policy settings.
like image 160
Andrej Kesely Avatar answered Sep 07 '25 04:09

Andrej Kesely