Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Going through HTML DOM in Python

I'm looking to write a Python script (using 3.4.3) that grabs a HTML page from a URL and can go through the DOM to try to find a specific element.

I currently have this:

#!/usr/bin/env python
import urllib.request

def getSite(url):
    return urllib.request.urlopen(url)

if __name__ == '__main__':
    content = getSite('http://www.google.com').read()
    print(content)

When I print content it does print out the entire html page which is something close to what I want... although I would ideally like to be able to navigate through the DOM rather then treating it as a giant string.

I'm still fairly new to Python but have experience with multiple other languages (mainly Java, C#, C++, C, PHP, JS). I've done something similar with Java before but wanted to try it out in Python.

like image 213
Jake Alsemgeest Avatar asked Mar 12 '15 03:03

Jake Alsemgeest


People also ask

Can Python access the DOM?

The DOM isn't a programming language, rather it's a programming interface therefore it's not limited to being used by only JavaScript and HTML. Here is a python script used to manipulate the DOM of an XML document.

Can Python read HTML file?

Practical Data Science using PythonUsing this library, we can search for the values of html tags and get specific data like title of the page and the list of headers in the page.

How do I transfer data from HTML to Python?

The sequence of actions for obtaining information from a web page will correspond to the steps in the previous section and in general terms looks like this: get the URL of the page from which we want to extract data. copy or download the HTML content of the page. parse the HTML content and get the necessary data.


2 Answers

There are many different modules you could use. For example, lxml or BeautifulSoup.

Here's an lxml example:

import lxml.html

mysite = urllib.request.urlopen('http://www.google.com').read()
lxml_mysite = lxml.html.fromstring(mysite)

description = lxml_mysite.xpath("//meta[@name='description']")[0] # meta tag description
text = description.get('content') # content attribute of the tag

>>> print(text)
"Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for."

And a BeautifulSoup example:

from bs4 import BeautifulSoup

mysite = urllib.request.urlopen('http://www.google.com').read()
soup_mysite = BeautifulSoup(mysite)

description = soup_mysite.find("meta", {"name": "description"}) # meta tag description
text = description['content'] # text of content attribute

>>> print(text)
u"Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for."

Notice how BeautifulSoup returns a unicode string, while lxml does not. This can be useful/hurtful depending on what is needed.

like image 181
Zach Gates Avatar answered Nov 05 '22 05:11

Zach Gates


Check out the BeautifulSoup module.

from bs4 import BeautifulSoup
import urllib                                       
soup = BeautifulSoup(urllib.urlopen("http://google.com").read())

for link in soup.find_all('a'):
    print(link.get('href'))
like image 3
Boa Avatar answered Nov 05 '22 06:11

Boa