Why is my web scraping code not extracting any content?

Question

I am writing up a lit review and trying to us Python Web Scraping the abstracts etc. info about other research on a web.

For example, I'd like to extract the content of 'Transcript' from this webpage https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1414/rec/3 and wrote a Python code, but it seems not working at all and didn't extract anything:

from bs4 import BeautifulSoup
import requests

url = "https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1417/rec/4"
html = requests.get(url,verify=False)

soup = BeautifulSoup(html.text,'html.parser')
item = soup.find('span', {'data-id': 'itemText'})
print(item)

Here is also a screenshot of the inspect, I wanted to extract the text paragraph.

screenshot

Andrej Kesely · Accepted Answer

The data you're looking for is stored inside <script> tag, so beautifulsoup doesn't see it. You can use re/json module to parse it:

import re
import json
import requests

url = "https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll17/id/1414/rec/3"
html_doc = requests.get(url).text

data = re.search(r"window\.__INITIAL_STATE__ = JSON.parse$(.*)$;", html_doc)
data = json.loads(json.loads(data.group(1)))

print(data["item"]["item"]["text"])

Prints:

This project will examine the economic impact of climate change, and climate change policy, on New Zealand households, families, and individuals. Price outputs and employment indices from Climate Change Commission models will be used with Treasury’s microsimulation model (TAWA) to model the impact on household incomes and expenditure due to different climate change mitigation pathways and policy settings.

Why is my web scraping code not extracting any content?

Tags:

python

python-3.x

beautifulsoup

tgallavich

1 Answers

Andrej Kesely

Recent Activity

Donate For Us

Why is my web scraping code not extracting any content?

Tags:

python

python-3.x

beautifulsoup

tgallavich

1 Answers

Andrej Kesely

Related questions

Recent Activity

Donate For Us