Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using BeautifulSoup to search HTML for string

I am using BeautifulSoup to look for user-entered strings on a specific page. For example, I want to see if the string 'Python' is located on the page: http://python.org

When I used: find_string = soup.body.findAll(text='Python'), find_string returned []

But when I used: find_string = soup.body.findAll(text=re.compile('Python'), limit=1), find_string returned [u'Python Jobs'] as expected

What is the difference between these two statements that makes the second statement work when there are more than one instances of the word to be searched?

like image 579
kachilous Avatar asked Jan 20 '12 02:01

kachilous


People also ask

How do I find the HTML element in BeautifulSoup?

Approach: Here we first import the regular expressions and BeautifulSoup libraries. Then we open the HTML file using the open function which we want to parse. Then using the find_all function, we find a particular tag that we pass inside that function and also the text we want to have within the tag.

Can BeautifulSoup parse HTML?

The HTML content of the webpages can be parsed and scraped with Beautiful Soup.


1 Answers

The following line is looking for the exact NavigableString 'Python':

>>> soup.body.findAll(text='Python') [] 

Note that the following NavigableString is found:

>>> soup.body.findAll(text='Python Jobs')  [u'Python Jobs'] 

Note this behaviour:

>>> import re >>> soup.body.findAll(text=re.compile('^Python$')) [] 

So your regexp is looking for an occurrence of 'Python' not the exact match to the NavigableString 'Python'.

like image 189
sgallen Avatar answered Oct 10 '22 08:10

sgallen