Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python regular expression for HTML parsing (BeautifulSoup)

I want to grab the value of a hidden input field in HTML.

<input type="hidden" name="fooId" value="12-3456789-1111111111" />

I want to write a regular expression in Python that will return the value of fooId, given that I know the line in the HTML follows the format

<input type="hidden" name="fooId" value="**[id is here]**" />

Can someone provide an example in Python to parse the HTML for the value?

like image 400
mshafrir Avatar asked Sep 10 '08 21:09

mshafrir


People also ask

Can you use regular expressions to parse HTML?

HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts.


1 Answers

For this particular case, BeautifulSoup is harder to write than a regex, but it is much more robust... I'm just contributing with the BeautifulSoup example, given that you already know which regexp to use :-)

from BeautifulSoup import BeautifulSoup

#Or retrieve it from the web, etc. 
html_data = open('/yourwebsite/page.html','r').read()

#Create the soup object from the HTML data
soup = BeautifulSoup(html_data)
fooId = soup.find('input',name='fooId',type='hidden') #Find the proper tag
value = fooId.attrs[2][1] #The value of the third attribute of the desired tag 
                          #or index it directly via fooId['value']
like image 108
Vinko Vrsalovic Avatar answered Sep 20 '22 08:09

Vinko Vrsalovic