Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting text from script tag using BeautifulSoup in Python

I am looking to extract email, phone and name value from the below code in SCRIPT tag(not in Body) using Beautiful soup(Python). I see Beautiful soup can be used for extracting.

I tried getting page using the following code -

fileDetails = BeautifulSoup(urllib2.urlopen('http://www.example.com').read())
results = fileDetails.find(email:")

This Ajax request code is not repeating in the page again. Can we also write try and catch so that if it doesn't found it in the page, it won't throw any error.

<script type="text/javascript" language='javascript'> 
$(document).ready( function (){
   
   $('#message').click(function(){
       alert();
   });

    $('#addmessage').click(function(){
        $.ajax({ 
            type: "POST",
            url: 'http://www.example.com',
            data: { 
                email: '[email protected]', 
                phone: '9999999999', 
                name: 'XYZ'
            }
        });
    });
});

Once I get this, I also want to store in an excel file.

Thanks in anticipation.

like image 781
Chopra Avatar asked Aug 04 '14 04:08

Chopra


1 Answers

You can get the script tag contents via BeautifulSoup and then apply a regex to get the desired data.

Working example (based on what you've described in the question):

import re
from bs4 import BeautifulSoup

data = """
<html>
    <head>
        <title>My Sample Page</title>
        <script>
        $.ajax({
            type: "POST",
            url: 'http://www.example.com',
            data: {
                email: '[email protected]',
                phone: '9999999999',
                name: 'XYZ'
            }
        });
        </script>
    </head>
    <body>
        <h1>What a wonderful world</h1>
    </body>
</html>
"""

soup = BeautifulSoup(data)
script = soup.find('script')

pattern = re.compile("(\w+): '(.*?)'")
fields = dict(re.findall(pattern, script.text))
print fields['email'], fields['phone'], fields['name']

Prints:

[email protected] 9999999999 XYZ

I don't really like the solution, since that regex approach is really fragile. All sorts of things can happen that would break it. I still think there is a better solution and we are missing a bigger picture here. Providing a link to that specific site would help a lot, but it is what it is.


UPD (fixing the code OP provided):

soup = BeautifulSoup(data, 'html.parser')
script = soup.html.find_next_sibling('script', text=re.compile(r"\$\(document\)\.ready"))

pattern = re.compile("(\w+): '(.*?)'")
fields = dict(re.findall(pattern, script.text))
print fields['email'], fields['phone'], fields['name']

prints:

[email protected] 9999999999 Shamita Shetty
like image 122
alecxe Avatar answered Nov 07 '22 07:11

alecxe