Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Web Scraping data using python?

I just started learning web scraping using Python. However, I've already ran into some problems.

My goal is to web scrape the names of the different tuna species from fishbase.org (http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=salmon)

The problem: I'm unable to extract all of the species names.

This is what I have so far:

import urllib2
from bs4 import BeautifulSoup

fish_url = 'http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Tuna'
page = urllib2.urlopen(fish_url)

soup = BeautifulSoup(html_doc)

spans = soup.find_all(

From here, I don't know how I would go about extracting the species names. I've thought of using regex (i.e. soup.find_all("a", text=re.compile("\d+\s+\d+")) to capture the texts inside the tag...

Any input will be highly appreciated!

like image 475
user1248092 Avatar asked Mar 05 '12 07:03

user1248092


People also ask

Is Python good for web scraping?

Most popular: Web scraping with PythonPython is regarded as the most commonly used programming language for web scraping. Incidentally, it is also the top programming language for 2021 according to IEEE Spectrum.

Is Python web scraping easy to learn?

Python is one of the easiest ways to get started as it is an object-oriented language. Python's classes and objects are significantly easier to use than in any other language. Additionally, many libraries exist that make building a tool for web scraping in Python an absolute breeze.

Which module is used for web scraping in Python?

Scrapy. Scrapy is a fast, open-source web crawling framework written in Python, used to extract the data from the web page with the help of selectors based on XPath.

Why Python is used for data scraping?

Python is free to use and distribute. Also, Python has a simple syntax that is easy to understand. Python can perform data analysis, web development, automation, scripting, software testing, prototyping, high-level data structures, data wrangling, and data scraping, among other tasks.


2 Answers

You might as well take advantage of the fact that all the scientific names (and only scientific names) are in <i/> tags:

scientific_names = [it.text for it in soup.table.find_all('i')]

Using BS and RegEx are two different approaches to parsing a webpage. The former exists so you don't have to bother so much with the latter.

You should read up on what BS actually does, it seems like you're underestimating its utility.

like image 123
joe Avatar answered Sep 26 '22 18:09

joe


What jozek suggests is the correct approach, but I couldn't get his snippet to work (but that's maybe because I am not running the BeautifulSoup 4 beta). What worked for me was:

import urllib2
from BeautifulSoup import BeautifulSoup

fish_url = 'http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Tuna'
page = urllib2.urlopen(fish_url)

soup = BeautifulSoup(page)

scientific_names = [it.text for it in soup.table.findAll('i')]

print scientific_names
like image 36
BioGeek Avatar answered Sep 26 '22 18:09

BioGeek