Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to ignore links that are located in parentheses?

I got a task to prove that most of wikipedia pages lead to a "Philosophy" page when you keep clicking the first link.

I created a code which finds the first link using xpath but the problem is I'm supposed to ignore links placed inside parentheses.

For example in text (links in bold): Semiosis (from the Greek: σημείωσις, sēmeíōsis, a derivation of the verb σημειῶ, sēmeiô, "to mark") is any form of activity...

The first link in this div is "Greek" but that will get me in a loop so I want to filter that out and click the first link following the parentheses. In this case "activity".

Is there a way to ignore the links in parentheses?

start_page = "https://en.wikipedia.org/wiki/Special:Random"

def click_link():
    link = driver.find_element_by_xpath("//div[@class='mw-parser-output']/p/a")
    link.click()    

driver.get(start_page)

redirects = 0

title = driver.find_element_by_tag_name("title").text

while title != "Philosophy":
    click_link()
    redirects += 1
    title = driver.find_element_by_tag_name("title").text

print(redirects)
like image 973
Chromec Avatar asked Nov 06 '22 19:11

Chromec


1 Answers

This is a lot more complicated than I initially thought it would be. The problem is that you can locate elements on the page using Selenium but you can't really see the textual context of that element, e.g. whether that link is inside ()s. This is where it gets more difficult. If you look at everything as elements, you can't see context (you can't see what elements are inside other elements). If you look at everything as text (get .text from the parent), you can no longer see what is a link). The only way I could think of to do this is to:

  1. Get the parent element that contains the first paragraph
  2. Use .get_attribute("innerHTML") to get the HTML contained in that element
  3. Search for a link that isn't inside ()s with a regex

The problem is that once you find that, you have the string of the A tag and not an actual element that you can click on. With that text, you can do a couple things...

  1. Get the text of the found link and find that on the page using a locator (so you can click on it) but that doesn't guarantee that it's the right link, e.g. imagine multiple links to "Greece" on the page, etc.

  2. The other option is to look at the href of the found A tag and then reconstruct the URL that you can navigate to.

Here's some code to get you going in the right direction. You'll have to decide which path you want to take from here.

import re
...
start_page = "https://en.wikipedia.org/wiki/Special:Random"
driver.get(start_page)
first_para = driver.find_element_by_css_selector("#mw-content-text > div > p")
text = first_para.text
regex = "(<a .*?<\/a>)|\(.*?\)"
matches = re.findall(regex, text)
print(matches[1])

This will print <a href="/wiki/Action_(philosophy)" title="Action (philosophy)">activity</a> which is the first A tag that is not inside ()s. From there, the approach is up to you. You can reconstruct the URL by parsing out the href attribute and appending it to the main URL like

new_url = "https://en.wikipedia.org" + href

or go a different direction. The choice is up to you and your requirements but this should be enough to get you started.

like image 64
JeffC Avatar answered Nov 15 '22 12:11

JeffC