I am looking to extract email, phone and name value from the below code in SCRIPT tag(not in Body) using Beautiful soup(Python). I see Beautiful soup can be used for extracting.
I tried getting page using the following code -
fileDetails = BeautifulSoup(urllib2.urlopen('http://www.example.com').read())
results = fileDetails.find(email:")
This Ajax request code is not repeating in the page again. Can we also write try and catch so that if it doesn't found it in the page, it won't throw any error.
<script type="text/javascript" language='javascript'>
$(document).ready( function (){
$('#message').click(function(){
alert();
});
$('#addmessage').click(function(){
$.ajax({
type: "POST",
url: 'http://www.example.com',
data: {
email: '[email protected]',
phone: '9999999999',
name: 'XYZ'
}
});
});
});
Once I get this, I also want to store in an excel file.
Thanks in anticipation.
You can get the script
tag contents via BeautifulSoup
and then apply a regex to get the desired data.
Working example (based on what you've described in the question):
import re
from bs4 import BeautifulSoup
data = """
<html>
<head>
<title>My Sample Page</title>
<script>
$.ajax({
type: "POST",
url: 'http://www.example.com',
data: {
email: '[email protected]',
phone: '9999999999',
name: 'XYZ'
}
});
</script>
</head>
<body>
<h1>What a wonderful world</h1>
</body>
</html>
"""
soup = BeautifulSoup(data)
script = soup.find('script')
pattern = re.compile("(\w+): '(.*?)'")
fields = dict(re.findall(pattern, script.text))
print fields['email'], fields['phone'], fields['name']
Prints:
[email protected] 9999999999 XYZ
I don't really like the solution, since that regex approach is really fragile. All sorts of things can happen that would break it. I still think there is a better solution and we are missing a bigger picture here. Providing a link to that specific site would help a lot, but it is what it is.
UPD (fixing the code OP provided):
soup = BeautifulSoup(data, 'html.parser')
script = soup.html.find_next_sibling('script', text=re.compile(r"\$\(document\)\.ready"))
pattern = re.compile("(\w+): '(.*?)'")
fields = dict(re.findall(pattern, script.text))
print fields['email'], fields['phone'], fields['name']
prints:
[email protected] 9999999999 Shamita Shetty
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With