How can I retrieve the page title of a webpage (title html tag) using Python?

Here's a simplified version of @Vinko Vrsalovic's answer: <pre class="prettyprint"><code>import urllib2 from BeautifulSoup import BeautifulSoup soup = BeautifulSoup(urllib2.urlopen("https://www.google.com")) print soup.title.string </code></pre> NOTE: <ul> <li>soup.title finds the first title element anywhere in the html document</li> <li>title.string assumes it has only one child node, and that child node is a string</li> </ul> For beautifulsoup 4.x, use different import: <pre class="prettyprint"><code>from bs4 import BeautifulSoup </code></pre>

I'll always use lxml for such tasks. You could use beautifulsoup as well. <pre class="prettyprint"><code>import lxml.html t = lxml.html.parse(url) print(t.find(".//title").text) </code></pre> EDIT based on comment: <pre class="prettyprint"><code>from urllib2 import urlopen from lxml.html import parse url = "https://www.google.com" page = urlopen(url) p = parse(page) print(p.find(".//title").text) </code></pre>

No need to import other libraries. Request has this functionality in-built. <pre class="prettyprint"><code>>> hearders = {'headers':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:51.0) Gecko/20100101 Firefox/51.0'} >>> n = requests.get('http://www.imdb.com/title/tt0108778/', headers=hearders) >>> al = n.text >>> al[al.find('<title>') + 7 : al.find('</title>')] u'Friends (TV Series 1994\u20132004) - IMDb' </code></pre>

The mechanize Browser object has a title() method. So the code from this post can be rewritten as: <pre class="prettyprint"><code>from mechanize import Browser br = Browser() br.open("http://www.google.com/") print br.title() </code></pre>

How can I retrieve the page title of a webpage using Python?

4 Answers

Here's a simplified version of @Vinko Vrsalovic's answer:

import urllib2
from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(urllib2.urlopen("https://www.google.com"))
print soup.title.string

NOTE:

soup.title finds the first title element anywhere in the html document
title.string assumes it has only one child node, and that child node is a string

For beautifulsoup 4.x, use different import:

from bs4 import BeautifulSoup

114

answered Oct 23 '22 17:10

jfs

I'll always use lxml for such tasks. You could use beautifulsoup as well.

import lxml.html
t = lxml.html.parse(url)
print(t.find(".//title").text)

EDIT based on comment:

from urllib2 import urlopen
from lxml.html import parse

url = "https://www.google.com"
page = urlopen(url)
p = parse(page)
print(p.find(".//title").text)

answered Oct 23 '22 16:10

Peter Hoffmann

No need to import other libraries. Request has this functionality in-built.

>> hearders = {'headers':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:51.0) Gecko/20100101 Firefox/51.0'}
>>> n = requests.get('http://www.imdb.com/title/tt0108778/', headers=hearders)
>>> al = n.text
>>> al[al.find('<title>') + 7 : al.find('</title>')]
u'Friends (TV Series 1994\u20132004) - IMDb'

answered Oct 23 '22 16:10

Rahul Chawla

The mechanize Browser object has a title() method. So the code from this post can be rewritten as:

from mechanize import Browser
br = Browser()
br.open("http://www.google.com/")
print br.title()

answered Oct 23 '22 16:10

codeape

Related questions
                            
                                Is it possible exclude test directories from coverage.py reports?
                            
                                Dynamically changing log level without restarting the application
                            
                                pandas out of bounds nanosecond timestamp after offset rollforward plus adding a month offset
                            
                                Python: How to pip install opencv2 with specific version 2.4.9?
                            
                                split a multi-page pdf file into multiple pdf files with python?
                            
                                Move legend outside figure in seaborn tsplot
                            
                                How to parse dates with -0400 timezone string in Python?
                            
                                How to remove extra indentation of Python triple quoted multi-line strings?
                            
                                pandas: to_numeric for multiple columns
                            
                                Docstring for variable
                            
                                ProgrammingError: SQLite objects created in a thread can only be used in that same thread
                            
                                Do you use the get/set pattern (in Python)?
                            
                                ipython notebook --pylab inline: zooming of a plot
                            
                                Change user-agent for Selenium web-driver
                            
                                Matplotlib: Specify format of floats for tick labels
                            
                                Why does corrcoef return a matrix?
                            
                                Get random sample from list while maintaining ordering of items?
                            
                                PIL "IOError: image file truncated" with big images
                            
                                What's the best way to initialize a dict of dicts in Python? [duplicate]
                            
                                RuntimeError: The current Numpy installation fails to pass a sanity check due to a bug in the windows runtime [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I retrieve the page title of a webpage using Python?

Tags:

python

html

cschol

People also ask

4 Answers

jfs

Peter Hoffmann

Rahul Chawla

codeape

Recent Activity

Donate For Us