Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract url GET parameter from <a> tag, from the full html text

Tags:

python

html

regex

So I have an html page. It's full of various tags, most of them have sessionid GET parameter in their href attribute. Example:

...
<a href="struct_view_distrib.asp?sessionid=11692390">
...
<a href="SHOW_PARENT.asp?sessionid=11692390">
...
<a href="nakl_view.asp?sessionid=11692390">
...
<a href="move_sum_to_7300001.asp?sessionid=11692390&mode_id=0">
...

So, as you see, sessionid is the same, i just need to get it's value into variable, no matter from which one: x=11692390 I'm newbie in regex, but google wasn't helpful. Thx a lot!

like image 231
creitve Avatar asked Aug 17 '10 09:08

creitve


1 Answers

This does not use regexes, but anyway, this is what you would do in Python 2.6:

from BeautifulSoup import BeautifulSoup
import urlparse

soup = BeautifulSoup(html)
links = soup.findAll('a', href=True)

for link in links:
  href = link['href']
  url = urlparse.urlparse(href)
  params = urlparse.parse_qs(url.query)
  if 'sessionid' in params:
    print params['sessionid'][0]
like image 97
Constantin Avatar answered Sep 19 '22 15:09

Constantin