I'm looking to write a Python script (using 3.4.3) that grabs a HTML page from a URL and can go through the DOM to try to find a specific element. I currently have this: <pre class="prettyprint"><code>#!/usr/bin/env python import urllib.request def getSite(url): return urllib.request.urlopen(url) if __name__ == '__main__': content = getSite('http://www.google.com').read() print(content) </code></pre> When I print content it does print out the entire html page which is something close to what I want... although I would ideally like to be able to navigate through the DOM rather then treating it as a giant string. I'm still fairly new to Python but have experience with multiple other languages (mainly Java, C#, C++, C, PHP, JS). I've done something similar with Java before but wanted to try it out in Python.

There are many different modules you could use. For example, lxml or BeautifulSoup. Here's an <code>lxml</code> example: <pre class="prettyprint"><code>import lxml.html mysite = urllib.request.urlopen('http://www.google.com').read() lxml_mysite = lxml.html.fromstring(mysite) description = lxml_mysite.xpath("//meta[@name='description']")[0] # meta tag description text = description.get('content') # content attribute of the tag >>> print(text) "Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for." </code></pre> And a <code>BeautifulSoup</code> example: <pre class="prettyprint"><code>from bs4 import BeautifulSoup mysite = urllib.request.urlopen('http://www.google.com').read() soup_mysite = BeautifulSoup(mysite) description = soup_mysite.find("meta", {"name": "description"}) # meta tag description text = description['content'] # text of content attribute >>> print(text) u"Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for." </code></pre> Notice how <code>BeautifulSoup</code> returns a unicode string, while <code>lxml</code> does not. This can be useful/hurtful depending on what is needed.

Going through HTML DOM in Python

Tags:

python

html

dom

httprequest

I'm looking to write a Python script (using 3.4.3) that grabs a HTML page from a URL and can go through the DOM to try to find a specific element.

I currently have this:

#!/usr/bin/env python
import urllib.request

def getSite(url):
    return urllib.request.urlopen(url)

if __name__ == '__main__':
    content = getSite('http://www.google.com').read()
    print(content)

When I print content it does print out the entire html page which is something close to what I want... although I would ideally like to be able to navigate through the DOM rather then treating it as a giant string.

I'm still fairly new to Python but have experience with multiple other languages (mainly Java, C#, C++, C, PHP, JS). I've done something similar with Java before but wanted to try it out in Python.

213

asked Mar 12 '15 03:03

Jake Alsemgeest

2 Answers

There are many different modules you could use. For example, lxml or BeautifulSoup.

Here's an lxml example:

import lxml.html

mysite = urllib.request.urlopen('http://www.google.com').read()
lxml_mysite = lxml.html.fromstring(mysite)

description = lxml_mysite.xpath("//meta[@name='description']")[0] # meta tag description
text = description.get('content') # content attribute of the tag

>>> print(text)
"Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for."

And a BeautifulSoup example:

from bs4 import BeautifulSoup

mysite = urllib.request.urlopen('http://www.google.com').read()
soup_mysite = BeautifulSoup(mysite)

description = soup_mysite.find("meta", {"name": "description"}) # meta tag description
text = description['content'] # text of content attribute

>>> print(text)
u"Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for."

Notice how BeautifulSoup returns a unicode string, while lxml does not. This can be useful/hurtful depending on what is needed.

181

answered Nov 05 '22 05:11

Zach Gates

Check out the BeautifulSoup module.

from bs4 import BeautifulSoup
import urllib                                       
soup = BeautifulSoup(urllib.urlopen("http://google.com").read())

for link in soup.find_all('a'):
    print(link.get('href'))

answered Nov 05 '22 06:11

Boa

Related questions
                            
                                What is the difference between imregionalmax() of matlab and scipy.ndimage.filters.maximum_filter
                            
                                writing logs using crontab
                            
                                How to fix an encoding migrating Python subprocess to unicode_literals?
                            
                                Python - keeping counter inside list comprehension
                            
                                Find all nodes by attribute in XML using Python 2
                            
                                How do I handle recursion in a custom PyYAML constructor?
                            
                                How can a class that inherits from a NumPy array change its own values?
                            
                                how to merge two pandas DataFrames and aggregate one specific column
                            
                                How do I store version number strings in a django database so they are correctly comparable/sortable
                            
                                Python: find all sets of numbers inside a range where each set contains values that are x distance apart and don't exceed the range
                            
                                How to bring Tkinter window in front of other windows?
                            
                                Python Collections Counter for a List of Dictionaries
                            
                                IntegrityError: datatype mismatch in Python using praw
                            
                                Python shuffle list not working [duplicate]
                            
                                Issue with strftime (python)
                            
                                Error when "Cancel" while opening a file in PyQt4
                            
                                how to make a rowcount in ponyorm? Python
                            
                                QueryDict to string loses list within JSON
                            
                                How to unpack a tuple into more values than the tuple has?
                            
                                Image foveation in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With