python, collecting links / script values from page

Question

I am trying to make a program to collect links and some values from a website. It works mostly well but I have come across a page in which it does not work.

With firebug I can see that this is the html code of the illusive "link" (cant find it when viewing the pages source thou):

<a class="visit" href="/tet?id=12&mv=13&san=221">

    221

</a>

and this is the script:

<td><a href=\"/tet?id=12&mv=13&san=221\" class=\"visit\">221<\/a><\/td><\/tr>

I'm wondering how to get either the "link" ("/tet?id=12&mv=13&san=221") from the html code and the string "221" from either the script or the html using selenium, mechanize or requests (or some other library)

I have made an unsuccessful attempt at getting it with mechanize using the br.links() function, which collected a number of links from the side, just not the one i am after

extra info: This might be important. to get to the page I have to click on a button with this code:

<a id="f33" class="button-flat small selected-no" onclick="qc.pA('visitform', 'f33', 'QClickEvent', '', 'f52'); if ($j('#f44').length == 0) { $j('f44').style.display='inline'; }; $j('#f38').hide();qc.recordControlModification('f38', 'DisplayStyle', 'hide'); document.getElementById('forumpanel').className = 'section-3'; return false;" href="#">

    load2

</a>

after which a "new page" loads in a part of the window (but the url never changes)

stuXnet · Accepted Answer

I think you pasted the wrong script of yours ;)

I'm not sure what you need exactly - there are at least two different approaches.

Matching all hrefs using regex
Matching specific tags and using getAttribute(...)

For the first one, you have to get the whole html source of the page with something like webdriver.page_source and using something like the following regex (you will have to escape either the normal or the double quotes!):

<a.+?href=['"](.*?)['"].*?/?>

If you need the hrefs of all matching links, you could use something similar to webdriver.find_elements_by_css_selector('.visit') (take care to choose find_elements_... instead of find_element_...!) to obtain a list of webelements and iterate through them to get their attributes.

This could result in code like this:

hrefs = []
elements = webdriver.find_elements_by_css_selector('.visit')

for element in elements:
    hrefs.append(element.getAttribute('href'))

Or a one liner using list comprehension:

hrefs = [element.getAttribute('href') for element \
         in webdriver.find_elements_by_css_selector('.visit')]

python, collecting links / script values from page

Tags:

python

python-requests

selenium

web-scraping

mechanize

user3053161

1 Answers

stuXnet

Recent Activity

Donate For Us

python, collecting links / script values from page

Tags:

python

python-requests

selenium

web-scraping

mechanize

user3053161

1 Answers

stuXnet

Related questions

Recent Activity

Donate For Us