Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find third occurring `<p>` tag using with Beautiful Soup

As the title suggests, I'm trying to understand how to find the third occurring <p> of a website (as an example, I used the following website: http://www.musicmeter.nl/album/31759).

Using the answer to this question, I tried the following code

from bs4 import BeautifulSoup
import requests
html = requests.get("http://www.musicmeter.nl/album/31759").text    # get HTML from http://www.musicmeter.nl/album/31759
soup = BeautifulSoup(html, 'html5lib')                              # Get data out of HTML

first_paragraph = soup.find('p')    # or just soup.p

print "first paragraph:", first_paragraph

second_paragraph = first_paragraph.find_next_siblings('p')

print "second paragraph:", second_paragraph

third_paragraph = second_paragraph.find_next_siblings('p')

print "third paragraph:", third_paragraph

But this code results in the following error for the third_paragraph:

Traceback (most recent call last):
  File "page_109.py", line 21, in <module>
    third_paragraph = second_paragraph.find_next_siblings('p')
AttributeError: 'ResultSet' object has no attribute 'find_next_siblings'

I tried to lookup the error, but I couldn't figure out what is wrong.

like image 932
Hunter Avatar asked Sep 12 '25 08:09

Hunter


2 Answers

You are using siblings i.e plural so you are getting a ResultSet/list back which you cannot call .find_next_siblings on.

If you wanted each next paragraph you would use sibling not siblings:

second_paragraph = first_paragraph.find_next_sibling('p')

print "second paragraph:", second_paragraph

third_paragraph = second_paragraph.find_next_sibling('p')

Which can be chained:

third_paragraph = soup.find("p").find_next_sibling('p').find_next_sibling("p")

A much simpler way is to use nth-of-type:

print(soup.select_one("p:nth-of-type(3)"))

You should also be aware that finding the third occurring p is not the same as finding the 2nd sibling to the first p you find on the page, using nth-of-type actually does find the third p tag in the page, if the first p does not have two sibling p tags then your logic will fail.

To really get the third occurring p using find logic just use find_next:

  third_paragraph = soup.find("p").find_next('p').find_next("p")

Of if you want the first three use find_all with a limit set to 3:

 soup.find_all("p", limit=3)

Of using your original logic to get the first two:

first_paragraph = soup.find('p')    # or just soup.p



second, third = first_paragraph.find_next_siblings("p", limit=2)

If you only want x tags then only parse x tags, just be sure you understand the difference between finding the third occurring <p> tag and the 2nd sibling to the first p tag as they may be different.

like image 66
Padraic Cunningham Avatar answered Sep 13 '25 22:09

Padraic Cunningham


.find_next_siblings('p') returns a BeautifulSoup result set which is like a list in python. Try the following code instead.

first_paragraph = soup.find('p')
siblings = first_paragraph.find_next_siblings('p')
print "second paragraph:", siblings[0]
print "third paragraph:", siblings[1]
like image 40
wyattis Avatar answered Sep 13 '25 23:09

wyattis