Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python, collecting links / script values from page

I am trying to make a program to collect links and some values from a website. It works mostly well but I have come across a page in which it does not work.

With firebug I can see that this is the html code of the illusive "link" (cant find it when viewing the pages source thou):

<a class="visit" href="/tet?id=12&mv=13&san=221">

    221

</a>

and this is the script:

<td><a href=\"/tet?id=12&mv=13&san=221\" class=\"visit\">221<\/a><\/td><\/tr>

I'm wondering how to get either the "link" ("/tet?id=12&mv=13&san=221") from the html code and the string "221" from either the script or the html using selenium, mechanize or requests (or some other library)

I have made an unsuccessful attempt at getting it with mechanize using the br.links() function, which collected a number of links from the side, just not the one i am after

extra info: This might be important. to get to the page I have to click on a button with this code:

<a id="f33" class="button-flat small selected-no" onclick="qc.pA('visitform', 'f33', 'QClickEvent', '', 'f52'); if ($j('#f44').length == 0) { $j('f44').style.display='inline'; }; $j('#f38').hide();qc.recordControlModification('f38', 'DisplayStyle', 'hide'); document.getElementById('forumpanel').className = 'section-3'; return false;" href="#">

    load2

</a>

after which a "new page" loads in a part of the window (but the url never changes)

like image 825
user3053161 Avatar asked Feb 17 '26 23:02

user3053161


1 Answers

I think you pasted the wrong script of yours ;)

I'm not sure what you need exactly - there are at least two different approaches.

  • Matching all hrefs using regex
  • Matching specific tags and using getAttribute(...)

For the first one, you have to get the whole html source of the page with something like webdriver.page_source and using something like the following regex (you will have to escape either the normal or the double quotes!):

<a.+?href=['"](.*?)['"].*?/?>

If you need the hrefs of all matching links, you could use something similar to webdriver.find_elements_by_css_selector('.visit') (take care to choose find_elements_... instead of find_element_...!) to obtain a list of webelements and iterate through them to get their attributes.

This could result in code like this:

hrefs = []
elements = webdriver.find_elements_by_css_selector('.visit')

for element in elements:
    hrefs.append(element.getAttribute('href'))

Or a one liner using list comprehension:

hrefs = [element.getAttribute('href') for element \
         in webdriver.find_elements_by_css_selector('.visit')]
like image 111
stuXnet Avatar answered Feb 19 '26 14:02

stuXnet



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!