I'm looking to write a Python script (using 3.4.3) that grabs a HTML page from a URL and can go through the DOM to try to find a specific element.
I currently have this:
#!/usr/bin/env python
import urllib.request
def getSite(url):
return urllib.request.urlopen(url)
if __name__ == '__main__':
content = getSite('http://www.google.com').read()
print(content)
When I print content it does print out the entire html page which is something close to what I want... although I would ideally like to be able to navigate through the DOM rather then treating it as a giant string.
I'm still fairly new to Python but have experience with multiple other languages (mainly Java, C#, C++, C, PHP, JS). I've done something similar with Java before but wanted to try it out in Python.
The DOM isn't a programming language, rather it's a programming interface therefore it's not limited to being used by only JavaScript and HTML. Here is a python script used to manipulate the DOM of an XML document.
Practical Data Science using PythonUsing this library, we can search for the values of html tags and get specific data like title of the page and the list of headers in the page.
The sequence of actions for obtaining information from a web page will correspond to the steps in the previous section and in general terms looks like this: get the URL of the page from which we want to extract data. copy or download the HTML content of the page. parse the HTML content and get the necessary data.
There are many different modules you could use. For example, lxml or BeautifulSoup.
Here's an lxml
example:
import lxml.html
mysite = urllib.request.urlopen('http://www.google.com').read()
lxml_mysite = lxml.html.fromstring(mysite)
description = lxml_mysite.xpath("//meta[@name='description']")[0] # meta tag description
text = description.get('content') # content attribute of the tag
>>> print(text)
"Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for."
And a BeautifulSoup
example:
from bs4 import BeautifulSoup
mysite = urllib.request.urlopen('http://www.google.com').read()
soup_mysite = BeautifulSoup(mysite)
description = soup_mysite.find("meta", {"name": "description"}) # meta tag description
text = description['content'] # text of content attribute
>>> print(text)
u"Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for."
Notice how BeautifulSoup
returns a unicode string, while lxml
does not. This can be useful/hurtful depending on what is needed.
Check out the BeautifulSoup module.
from bs4 import BeautifulSoup
import urllib
soup = BeautifulSoup(urllib.urlopen("http://google.com").read())
for link in soup.find_all('a'):
print(link.get('href'))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With