When you use BeautifulSoup to scrape a certain part of a website, you can use
soup.find()
and soup.findAll()
orsoup.select()
.Is there a difference between the .find()
and the .select()
methods?
(e.g. In performance or flexibility, etc.) Or are they the same?
find is used for returning the result when the searched element is found on the page. find_all is used for returning all the matches after scanning the entire document. It is used for getting merely the first tag of the incoming HTML object for which condition is satisfied.
select_one does something similar using select: def select_one(self, selector): """Perform a CSS selection operation on the current element.""" value = self.select(selector, limit=1) if value: return value[0] return None.
To summarise the comments:
soup.select("div[id=foo] > div > div > div[class=fee] > span > span > a")
would look pretty ugly using multiple chained find/find_all calls.find("a", href=re.compile(....))
but again that is personal preference.For performance we can run some tests, I modified the code from an answer here running on 800+ html files taken from here, is is not exhaustive but should give a clue to the readability of some of the options and the performance:
The modified functions are:
from bs4 import BeautifulSoup
from glob import iglob
def parse_find(soup):
author = soup.find("h4", class_="h12 talk-link__speaker").text
title = soup.find("h4", class_="h9 m5").text
date = soup.find("span", class_="meta__val").text.strip()
soup.find("footer",class_="footer").find_previous("data", {
"class": "talk-transcript__para__time"}).text.split(":")
soup.find_all("span",class_="talk-transcript__fragment")
def parse_select(soup):
author = soup.select_one("h4.h12.talk-link__speaker").text
title = soup.select_one("h4.h9.m5").text
date = soup.select_one("span.meta__val").text.strip()
soup.select_one("footer.footer").find_previous("data", {
"class": "talk-transcript__para__time"}).text
soup.select("span.talk-transcript__fragment")
def test(patt, func):
for html in iglob(patt):
with open(html) as f:
func(BeautifulSoup(f, "lxml")
Now for the timings:
In [7]: from testing import test, parse_find, parse_select
In [8]: timeit test("./talks/*.html",parse_find)
1 loops, best of 3: 51.9 s per loop
In [9]: timeit test("./talks/*.html",parse_select)
1 loops, best of 3: 32.7 s per loop
Like I said not exhaustive but I think we can safely say the css selectors are definitely more efficient.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With