I am trying to scan various websites using python. The following code works fine for me.
import urllib
import re
htmlfile =urllib.urlopen("http://google.com")
htmltext=htmlfile.read()
regex='<title>(.+?)</title>'
pattern=re.compile(regex)
title= re.findall(pattern,htmltext)
print title
To get the body content, I changed it as follows:
import urllib
import re
htmlfile =urllib.urlopen("http://google.com")
htmltext=htmlfile.read()
regex='<body>(.+?)</body>'
pattern=re.compile(regex)
title= re.findall(pattern,htmltext)
print title
The above code is giving me an empty box brackets. I don't know what I am doing wrong. Please help
Generally it's a bad idea to attempt to parse HTML with regular expressions.
The excellent beautiful soup library makes what you're trying to do trivial.
import bs4
html = '''
<head>
</head>
<body>
<div></div>
</body>
'''
print(bs4.BeautifulSoup(html).find('body'))
Python also has an HTML parser in its standard library, which is basically a less feature rich version of the beautiful soup parser.
If you're still insistent on using regex, this should work.
import re
print(re.findall('<body>(.*?)</body>', html, re.DOTALL))
Also this may sound dumb, but make sure there're actually body tags in the htmltext string.
To answer the question, actually if you go through HTMLtext, you won't find two body tags. But I definitely recommend you to take the Beautiful Soup route as @rectangletangle mentions
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With