Getting the content in body of a webpage using python

Question

I am trying to scan various websites using python. The following code works fine for me.

import urllib
import re
htmlfile =urllib.urlopen("http://google.com")
htmltext=htmlfile.read()
regex='<title>(.+?)</title>'
pattern=re.compile(regex)
title= re.findall(pattern,htmltext)
print title

To get the body content, I changed it as follows:

import urllib
import re
htmlfile =urllib.urlopen("http://google.com")
htmltext=htmlfile.read()
regex='<body>(.+?)</body>'
pattern=re.compile(regex)
title= re.findall(pattern,htmltext)
print title

The above code is giving me an empty box brackets. I don't know what I am doing wrong. Please help

rectangletangle · Accepted Answer

Generally it's a bad idea to attempt to parse HTML with regular expressions.

The excellent beautiful soup library makes what you're trying to do trivial.

import bs4

html = '''
<head>
</head>
<body>
  <div></div>
</body>
'''

print(bs4.BeautifulSoup(html).find('body'))

Python also has an HTML parser in its standard library, which is basically a less feature rich version of the beautiful soup parser.

If you're still insistent on using regex, this should work.

import re
print(re.findall('<body>(.*?)</body>', html, re.DOTALL))

Also this may sound dumb, but make sure there're actually body tags in the htmltext string.

ForgetfulFellow · Answer

To answer the question, actually if you go through HTMLtext, you won't find two body tags. But I definitely recommend you to take the Beautiful Soup route as @rectangletangle mentions

Getting the content in body of a webpage using python

Tags:

python

web-scraping

Saurabh

2 Answers

rectangletangle

ForgetfulFellow

Recent Activity

Donate For Us

Getting the content in body of a webpage using python

Tags:

python

web-scraping

Saurabh

2 Answers

rectangletangle

ForgetfulFellow

Related questions

Recent Activity

Donate For Us