Beautiful Soup and extracting a div and its contents by ID

People also ask

How do I find the HTML element in BeautifulSoup?

Approach: Here we first import the regular expressions and BeautifulSoup libraries. Then we open the HTML file using the open function which we want to parse. Then using the find_all function, we find a particular tag that we pass inside that function and also the text we want to have within the tag.

You should post your example document, because the code works fine:

>>> import BeautifulSoup
>>> soup = BeautifulSoup.BeautifulSoup('<html><body><div id="articlebody"> ... </div></body></html')
>>> soup.find("div", {"id": "articlebody"})
<div id="articlebody"> ... </div>

Finding <div>s inside <div>s works as well:

>>> soup = BeautifulSoup.BeautifulSoup('<html><body><div><div id="articlebody"> ... </div></div></body></html')
>>> soup.find("div", {"id": "articlebody"})
<div id="articlebody"> ... </div>

To find an element by its id:

div = soup.find(id="articlebody")

Beautiful Soup 4 supports most CSS selectors with the .select() method, therefore you can use an id selector such as:

soup.select('#articlebody')

If you need to specify the element's type, you can add a type selector before the id selector:

soup.select('div#articlebody')

The .select() method will return a collection of elements, which means that it would return the same results as the following .find_all() method example:

soup.find_all('div', id="articlebody")
# or
soup.find_all(id="articlebody")

If you only want to select a single element, then you could just use the .find() method:

soup.find('div', id="articlebody")
# or
soup.find(id="articlebody")

I think there is a problem when the 'div' tags are too much nested. I am trying to parse some contacts from a facebook html file, and the Beautifulsoup is not able to find tags "div" with class "fcontent".

This happens with other classes as well. When I search for divs in general, it turns only those that are not so much nested.

The html source code can be any page from facebook of the friends list of a friend of you (not the one of your friends). If someone can test it and give some advice I would really appreciate it.

This is my code, where I just try to print the number of tags "div" with class "fcontent":

from BeautifulSoup import BeautifulSoup 
f = open('/Users/myUserName/Desktop/contacts.html')
soup = BeautifulSoup(f) 
list = soup.findAll('div', attrs={'class':'fcontent'})
print len(list)

Most probably because of the default beautifulsoup parser has problem. Change a different parser, like 'lxml' and try again.

In the beautifulsoup source this line allows divs to be nested within divs; so your concern in lukas' comment wouldn't be valid.

NESTABLE_BLOCK_TAGS = ['blockquote', 'div', 'fieldset', 'ins', 'del']

What I think you need to do is to specify the attrs you want such as

source.find('div', attrs={'id':'articlebody'})

Related questions
                            
                                Pandas: sum DataFrame rows for given columns
                            
                                Using multiple arguments for string formatting in Python (e.g., '%s ... %s')
                            
                                What is the reason for performing a double fork when creating a daemon?
                            
                                Add missing dates to pandas dataframe
                            
                                Log exception with traceback in python
                            
                                Convert Unicode to ASCII without errors in Python
                            
                                Replacing column values in a pandas DataFrame
                            
                                Python to print out status bar and percentage
                            
                                Python UTC datetime object's ISO format doesn't include Z (Zulu or Zero offset)
                            
                                Finding the source code for built-in Python functions?
                            
                                Why is printing to stdout so slow? Can it be sped up?
                            
                                Ordering of batch normalization and dropout?
                            
                                How can I use if/else in a dictionary comprehension?
                            
                                Pandas DataFrame column to list [duplicate]
                            
                                Convert Python program to C/C++ code?
                            
                                Are lists thread-safe?
                            
                                Requests -- how to tell if you're getting a 404
                            
                                Random number between 0 and 1 in python [duplicate]
                            
                                Python json.loads shows ValueError: Extra data
                            
                                argparse: identify which subparser was used [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Beautiful Soup and extracting a div and its contents by ID

Tags:

python

beautifulsoup

People also ask

Recent Activity

Donate For Us