Web Scraping data using python?

Tags:

I just started learning web scraping using Python. However, I've already ran into some problems.

My goal is to web scrape the names of the different tuna species from fishbase.org (http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=salmon)

The problem: I'm unable to extract all of the species names.

This is what I have so far:

Click to copy

import urllib2
from bs4 import BeautifulSoup

fish_url = 'http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Tuna'
page = urllib2.urlopen(fish_url)

soup = BeautifulSoup(html_doc)

spans = soup.find_all(

From here, I don't know how I would go about extracting the species names. I've thought of using regex (i.e. soup.find_all("a", text=re.compile("\d+\s+\d+")) to capture the texts inside the tag...

Any input will be highly appreciated!

475

asked Mar 05 '12 07:03

user1248092

2 Answers

You might as well take advantage of the fact that all the scientific names (and only scientific names) are in <i/> tags:

Click to copy

scientific_names = [it.text for it in soup.table.find_all('i')]

Using BS and RegEx are two different approaches to parsing a webpage. The former exists so you don't have to bother so much with the latter.

You should read up on what BS actually does, it seems like you're underestimating its utility.

123

answered Sep 26 '22 18:09

joe

What jozek suggests is the correct approach, but I couldn't get his snippet to work (but that's maybe because I am not running the BeautifulSoup 4 beta). What worked for me was:

Click to copy

import urllib2
from BeautifulSoup import BeautifulSoup

fish_url = 'http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Tuna'
page = urllib2.urlopen(fish_url)

soup = BeautifulSoup(page)

scientific_names = [it.text for it in soup.table.findAll('i')]

print scientific_names

answered Sep 26 '22 18:09

BioGeek

Related questions
                            
                                Create staticmethod from an existing method outside of the class? ("unbound method" error)
                            
                                Why python isn't handling very large numbers in all areas?
                            
                                Percentage sign not working
                            
                                Interpreting Strings as Other Data Types in Python
                            
                                Distributing integers using weights? How to calculate?
                            
                                Text changed signal for Text View widget in GTK3
                            
                                Using Python Tkinter: Always on top window isn't showing custom class tooltip text
                            
                                Threading and information passing -- how to
                            
                                Is there a way to tell if python was configured and compiled with "--with-threads --enable-shared"?
                            
                                Python treat files with uppercase and lowercase names the same
                            
                                Using custom formatter classes with Python's logging.config module
                            
                                Python inspect.getmembers does not return the actual function when used with decorators
                            
                                finding n largest differences between two lists
                            
                                Delete user when deleting UserProfile
                            
                                Preformat to currency and two decimal places in python using xlwt for excel
                            
                                Is there a way to incorporate python code into moinmoin pages?
                            
                                how does list(string) split the string to an array of characters in python?
                            
                                pickling class method
                            
                                How to select parent based on the child in lxml?
                            
                                How does web2py query expressions work?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Web Scraping data using python?

Tags:

python

html

beautifulsoup

web-scraping

user1248092

People also ask

2 Answers

joe

BioGeek

Recent Activity

Donate For Us