Extracting text from script tag using BeautifulSoup in Python

Question

I am looking to extract email, phone and name value from the below code in SCRIPT tag(not in Body) using Beautiful soup(Python). I see Beautiful soup can be used for extracting.

I tried getting page using the following code -

fileDetails = BeautifulSoup(urllib2.urlopen('http://www.example.com').read())
results = fileDetails.find(email:")

This Ajax request code is not repeating in the page again. Can we also write try and catch so that if it doesn't found it in the page, it won't throw any error.

<script type="text/javascript" language='javascript'> 
$(document).ready( function (){
   
   $('#message').click(function(){
       alert();
   });

    $('#addmessage').click(function(){
        $.ajax({ 
            type: "POST",
            url: 'http://www.example.com',
            data: { 
                email: 'abc@g.com', 
                phone: '9999999999', 
                name: 'XYZ'
            }
        });
    });
});

Once I get this, I also want to store in an excel file.

Thanks in anticipation.

alecxe · Accepted Answer

You can get the script tag contents via BeautifulSoup and then apply a regex to get the desired data.

Working example (based on what you've described in the question):

import re
from bs4 import BeautifulSoup

data = """
<html>
    <head>
        <title>My Sample Page</title>
        <script>
        $.ajax({
            type: "POST",
            url: 'http://www.example.com',
            data: {
                email: 'abc@g.com',
                phone: '9999999999',
                name: 'XYZ'
            }
        });
        </script>
    </head>
    <body>
        <h1>What a wonderful world</h1>
    </body>
</html>
"""

soup = BeautifulSoup(data)
script = soup.find('script')

pattern = re.compile("(\w+): '(.*?)'")
fields = dict(re.findall(pattern, script.text))
print fields['email'], fields['phone'], fields['name']

Prints:

abc@g.com 9999999999 XYZ

I don't really like the solution, since that regex approach is really fragile. All sorts of things can happen that would break it. I still think there is a better solution and we are missing a bigger picture here. Providing a link to that specific site would help a lot, but it is what it is.

UPD (fixing the code OP provided):

soup = BeautifulSoup(data, 'html.parser')
script = soup.html.find_next_sibling('script', text=re.compile(r"\$$document$\.ready"))

pattern = re.compile("(\w+): '(.*?)'")
fields = dict(re.findall(pattern, script.text))
print fields['email'], fields['phone'], fields['name']

prints:

abcd@gmail.com 9999999999 Shamita Shetty

Extracting text from script tag using BeautifulSoup in Python

Tags:

python

beautifulsoup

urllib2

Chopra

1 Answers

alecxe

Recent Activity

Donate For Us

Extracting text from script tag using BeautifulSoup in Python

Tags:

python

beautifulsoup

urllib2

Chopra

1 Answers

alecxe

Related questions

Recent Activity

Donate For Us