I just started learning web scraping using Python. However, I've already ran into some problems.
My goal is to web scrape the names of the different tuna species from fishbase.org (http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=salmon)
The problem: I'm unable to extract all of the species names.
This is what I have so far:
import urllib2
from bs4 import BeautifulSoup
fish_url = 'http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Tuna'
page = urllib2.urlopen(fish_url)
soup = BeautifulSoup(html_doc)
spans = soup.find_all(
From here, I don't know how I would go about extracting the species names. I've thought of using regex (i.e. soup.find_all("a", text=re.compile("\d+\s+\d+"))
to capture the texts inside the tag...
Any input will be highly appreciated!
Most popular: Web scraping with PythonPython is regarded as the most commonly used programming language for web scraping. Incidentally, it is also the top programming language for 2021 according to IEEE Spectrum.
Python is one of the easiest ways to get started as it is an object-oriented language. Python's classes and objects are significantly easier to use than in any other language. Additionally, many libraries exist that make building a tool for web scraping in Python an absolute breeze.
Scrapy. Scrapy is a fast, open-source web crawling framework written in Python, used to extract the data from the web page with the help of selectors based on XPath.
Python is free to use and distribute. Also, Python has a simple syntax that is easy to understand. Python can perform data analysis, web development, automation, scripting, software testing, prototyping, high-level data structures, data wrangling, and data scraping, among other tasks.
You might as well take advantage of the fact that all the scientific names (and only scientific names) are in <i/>
tags:
scientific_names = [it.text for it in soup.table.find_all('i')]
Using BS and RegEx are two different approaches to parsing a webpage. The former exists so you don't have to bother so much with the latter.
You should read up on what BS actually does, it seems like you're underestimating its utility.
What jozek suggests is the correct approach, but I couldn't get his snippet to work (but that's maybe because I am not running the BeautifulSoup 4 beta). What worked for me was:
import urllib2
from BeautifulSoup import BeautifulSoup
fish_url = 'http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Tuna'
page = urllib2.urlopen(fish_url)
soup = BeautifulSoup(page)
scientific_names = [it.text for it in soup.table.findAll('i')]
print scientific_names
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With