I'd like to extract the content Hello world
. Please note that there are multiples <table>
and similar <td colspan="2">
on the page as well:
<table border="0" cellspacing="2" width="800"> <tr> <td colspan="2"><b>Name: </b>Hello world</td> </tr> <tr> ...
I tried the following:
hello = soup.find(text='Name: ') hello.findPreviousSiblings
But it returned nothing.
In addition, I'm also having problem with the following extracting the My home address
:
<td><b>Address:</b></td> <td>My home address</td>
I'm also using the same method to search for the text="Address: "
but how do I navigate down to the next line and extract the content of <td>
?
Sending an HTTP GET request to the URL of the webpage that you want to scrape, which will respond with HTML content. We can do this by using the Request library of Python. Fetching and parsing the data using Beautifulsoup and maintain the data in some data structure such as Dict or List.
The contents
operator works well for extracting text
from <tag>text</tag>
.
<td>My home address</td>
example:
s = '<td>My home address</td>' soup = BeautifulSoup(s) td = soup.find('td') #<td>My home address</td> td.contents #My home address
<td><b>Address:</b></td>
example:
s = '<td><b>Address:</b></td>' soup = BeautifulSoup(s) td = soup.find('td').find('b') #<b>Address:</b> td.contents #Address:
use next instead
>>> s = '<table border="0" cellspacing="2" width="800"><tr><td colspan="2"><b>Name: </b>Hello world</td></tr><tr>' >>> soup = BeautifulSoup(s) >>> hello = soup.find(text='Name: ') >>> hello.next u'Hello world'
next and previous let you move through the document elements in the order they were processed by the parser while sibling methods work with the parse tree
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With