Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract content from <script> using Beautiful Soup

I am attempting to extract campaign_hearts and postal_code from the code in the script tag here (the entire code is too long to post):

<script>
...    
"campaign_hearts":4817,"social_share_total":11242,"social_share_last_update":"2020-01-17T10:51:22-06:00","location":{"city":"Los Angeles, CA","country":"US","postal_code":"90012"},"is_partner":false,"partner":{},"is_team":true,"team":{"name":"Team STEVENS NATION","team_pic_url":"https://d2g8igdw686xgo.cloudfront.net
...

I can identify the script I need with the following code:

from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
from time import sleep
import requests 
import re
import json


page = requests.get("https://www.gofundme.com/f/eric-stevens-care-trust")

soup = BeautifulSoup(page.content, 'html.parser')

all_scripts = soup.find_all('script')
all_scripts[0]

However, I'm at a loss for how to extract the values I want. (I'm very new to Python.) This thread recommended the following solution for a similar problem (edited to reflect the html I'm working with).

data = json.loads(all_scripts[0].get_text()[27:])

However, running this produces an error: JSONDecodeError: Expecting value: line 1 column 1 (char 0).

What can I do to extract the values I need now that I have the correct script identified? I have also tried the solutions listed here, but had trouble importing Parser.

like image 872
RJames Avatar asked Jan 17 '20 20:01

RJames


2 Answers

You can parse the content of <script> with json module and then get your values. For example:

import re
import json
import requests

url = 'https://www.gofundme.com/f/eric-stevens-care-trust'

txt = requests.get(url).text

data = json.loads(re.findall(r'window\.initialState = ({.*?});', txt)[0])

# print( json.dumps(data, indent=4) )  # <-- uncomment this to see all data

print('Campaign Hearts =', data['feed']['campaign']['campaign_hearts'])
print('Postal Code     =', data['feed']['campaign']['location']['postal_code'])

Prints:

Campaign Hearts = 4817
Postal Code     = 90012
like image 164
Andrej Kesely Avatar answered Oct 09 '22 17:10

Andrej Kesely


The more libraries you use; the more inefficient a code becomes! Here is a simpler solution-

#This imports the website content.

import requests
url = "https://www.gofundme.com/f/eric-stevens-care-trust"
a = requests.post(url)
a= (a.content)
print(a)

#These will show your data.

campaign_hearts = str(a,'utf-8').split('campaign_hearts":')[1]
campaign_hearts = campaign_hearts.split(',"social_share_total"')[0]
print(campaign_hearts)

postal_code = str(a,'utf-8').split('postal_code":"')[1]
postal_code = postal_code.split('"},"is_partner')[0]
print(postal_code)   
like image 23
Amit Ghosh Avatar answered Oct 09 '22 18:10

Amit Ghosh