Beautifulsoup : Difference between .find() and .select()

1 Answers

To summarise the comments:

select finds multiple instances and returns a list, find finds the first, so they don't do the same thing. select_one would be the equivalent to find.
I almost always use css selectors when chaining tags or using tag.classname, if looking for a single element without a class I use find. Essentially it comes down to the use case and personal preference.
As far as flexibility goes I think you know the answer, soup.select("div[id=foo] > div > div > div[class=fee] > span > span > a") would look pretty ugly using multiple chained find/find_all calls.
The only issue with the css selectors in bs4 is the very limited support, nth-of-type is the only pseudo class implemented and chaining attributes like a[href][src] is also not supported as are many other parts of css selectors. But things like a[href=..]* , a[href^=], a[href$=] etc.. are I think much nicer than find("a", href=re.compile(....)) but again that is personal preference.

For performance we can run some tests, I modified the code from an answer here running on 800+ html files taken from here, is is not exhaustive but should give a clue to the readability of some of the options and the performance:

The modified functions are:

from bs4 import BeautifulSoup
from glob import iglob


def parse_find(soup):
    author = soup.find("h4", class_="h12 talk-link__speaker").text
    title = soup.find("h4", class_="h9 m5").text
    date = soup.find("span", class_="meta__val").text.strip()
    soup.find("footer",class_="footer").find_previous("data", {
        "class": "talk-transcript__para__time"}).text.split(":")
    soup.find_all("span",class_="talk-transcript__fragment")



def parse_select(soup):
    author = soup.select_one("h4.h12.talk-link__speaker").text
    title = soup.select_one("h4.h9.m5").text
    date = soup.select_one("span.meta__val").text.strip()
    soup.select_one("footer.footer").find_previous("data", {
        "class": "talk-transcript__para__time"}).text
    soup.select("span.talk-transcript__fragment")


def  test(patt, func):
    for html in iglob(patt):
        with open(html) as f:
            func(BeautifulSoup(f, "lxml")

Now for the timings:

In [7]: from testing import test, parse_find, parse_select

In [8]: timeit test("./talks/*.html",parse_find)
1 loops, best of 3: 51.9 s per loop

In [9]: timeit test("./talks/*.html",parse_select)
1 loops, best of 3: 32.7 s per loop

Like I said not exhaustive but I think we can safely say the css selectors are definitely more efficient.

125

answered Oct 20 '22 13:10

Padraic Cunningham

Related questions
                            
                                Is PyPI case sensitive?
                            
                                python arbitrarily incrementing an iterator inside a loop
                            
                                How to stop/kill Airflow tasks from the UI
                            
                                Python: Can't pickle type X, attribute lookup failed
                            
                                Learning asyncio: "coroutine was never awaited" warning error
                            
                                Variable table name in sqlite
                            
                                SQLAlchemy ManyToMany secondary table with additional fields
                            
                                Saving response from Requests to file
                            
                                Django Celery - Cannot connect to amqp://[email protected]:5672//
                            
                                How do you reload a Django model module using the interactive interpreter via "manage.py shell"?
                            
                                Why not generate the secret key every time Flask starts?
                            
                                How to install Anaconda python for all users?
                            
                                Testing logging output with pytest
                            
                                Where to put Django startup code?
                            
                                Split string on commas but ignore commas within double-quotes?
                            
                                How to update matplotlib's imshow() window interactively?
                            
                                How do I check if two variables reference the same object in Python?
                            
                                Python - Should I put my helper functions inside or outside the class? [closed]
                            
                                Access item in a list of lists
                            
                                Why are the values of an OrderedDict not equal?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Beautifulsoup : Difference between .find() and .select()

Tags:

python

python-3.x

beautifulsoup

Dieter

People also ask

1 Answers

Padraic Cunningham

Recent Activity

Donate For Us