How to find backlinks in a website with python [closed]

Question

I am kind of stuck with this situation, I want to find backlinks of websites, I cannot find how to do it, here is my regex:

readh = BeautifulSoup(urllib.urlopen("http://www.google.com/").read()).findAll("a",href=re.compile("^http"))

What I want to do, to find backlinks, is that, finding links that starts with http but not the links that include google, and I cannot figure out how to manage this?

7stud · Accepted Answer

from BeautifulSoup import BeautifulSoup
import re

html = """
<div>hello</div>
<a href="/index.html">Not this one</a>"
<a href="http://google.com">Link 1</a>
<a href="http:/amazon.com">Link 2</a>
"""

def processor(tag):
    href = tag.get('href')
    if not href: return False
    return True if (href.find("google") == -1) else False

soup = BeautifulSoup(html)
back_links = soup.findAll(processor, href=re.compile(r"^http"))
print back_links

--output:--
[<a href="http:/amazon.com">Link 2</a>]

However, it may be more efficient just to get all the links starting with http, then search those links for links that do not have 'google' in their hrefs:

http_links = soup.findAll('a', href=re.compile(r"^http"))
results = [a for a in http_links if a['href'].find('google') == -1]
print results

--output:--
[<a href="http:/amazon.com">Link 2</a>]

vegi · Answer

Here is a regexp that matches http pages but not if including google:

re.compile("(?!.*google)^http://(www.)?.*")

How to find backlinks in a website with python [closed]

Tags:

python

regex

beautifulsoup

user2682790

2 Answers

7stud

vegi

Recent Activity

Donate For Us

How to find backlinks in a website with python [closed]

Tags:

python

regex

beautifulsoup

user2682790

2 Answers

7stud

vegi

Related questions

Recent Activity

Donate For Us