Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find nearest link with BeautifullSoup (python)

I am doing a small project where I extract occurences of political leaders in newspapers. Sometimes a politician will be mentioned, and there is neither a parent or child with a link. (due I guess to semantically bad markup).

So I want to create a function that can find the nearest link, and then extract that. In the case below the search string is Rasmussen and the link I want is: /307046.

#-*- coding: utf-8 -*-

from bs4 import BeautifulSoup
import re

tekst = '''
<li>
  <div class="views-field-field-webrubrik-value">
    <h3>
      <a href="/307046">Claus Hjort spiller med mrkede kort</a>
    </h3>
  </div>
  <div class="views-field-field-skribent-uid">
    <div class="byline">Af: <span class="authors">Dennis Kristensen</span></div>
  </div>
  <div class="views-field-field-webteaser-value">
    <div class="webteaser">Claus Hjort Frederiksens argumenter for at afvise
      trepartsforhandlinger har ikke hold i virkeligheden. Hans rinde er nok
      snarere at forberede det ideologiske grundlag for en Løkke Rasmussens
      genkomst som statsministe
    </div>
  </div>
  <span class="views-field-view-node">
    <span class="actions">
      <a href="/307046">Ls mere</a>
      |
      <a href="/307046/#comments">Kommentarer (4)</a>
    </span>
  </span>
</li>
'''

to_find = "Rasmussen"
soup = BeautifulSoup(tekst)
contexts = soup.find_all(text=re.compile(to_find)) 

def find_nearest(element, url, direction="both"):
    """Find the nearest link, relative to a text string.
    When complete it will search up and down (parent, child),
    and only X levels up down. These features are not implemented yet.
    Will then return the link the fewest steps away from the
    original element. Assumes we have already found an element"""

    # Is the nearest link readily available?
    # If so - this works and extracts the link.
    if element.find_parents('a'):
        for artikel_link in element.find_parents('a'):
            link = artikel_link.get('href')
            # sometimes the link is a relative link - sometimes it is not
            if ("http" or "www") not in link:
                link = url+link
                return link
    # But if the link is not readily available, we will go up
    # This is (I think) where it goes wrong
    # ↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓
    if not element.find_parents('a'):
        element =  element.parent
        # Print for debugging
        print element #on the 2nd run (i.e <li> this finds <a href=/307056> 
        # So shouldn't it be caught as readily available above?
        print u"Found: %s" % element.name
        # the recursive call
        find_nearest(element,url)

# run it
if contexts:
    for a in contexts:
        find_nearest( element=a, url="http://information.dk")

The direct call below works:

print contexts[0].parent.parent.parent.a['href'].encode('utf-8')

For reference the whole sorry code is on bitbucket: https://bitbucket.org/achristoffersen/politikere-i-medierne

(p.s. Using BeautifullSoup 4)


EDIT: SimonSapin asks me to define nearest: By nearest I mean the link that are the fewest nesting levels away from the search term, in either direction. In the text above, the a href produced by the drupal-based newspaper site, is neither a direct parent or child of the tag where the search string is found. So BeautifullSoup Can't find it.

I suspect a 'fewest charachters' away would often work too. In that case a soulution could be hacked together with find and rfind - but I would really like to do this via BS. Since this would work: contexts[0].parent.parent.parent.a['href'].encode('utf-8') it must be possible to generalise that to a script.

EDIT: Maybe I should emphasize that I'am looking for a BeautifulSoup solution. Combining BS with a custom/simpel breath-first-search as suggested by @erik85 would quickly become messy I think.

like image 251
Andreas Avatar asked Aug 02 '12 11:08

Andreas


People also ask

What is Find () method in BeautifulSoup?

find() method The find method is used for finding out the first tag with the specified name or id and returning an object of type bs4. Example: For instance, consider this simple HTML webpage having different paragraph tags.

How do you follow a link in Python?

Following Links in Python: The program will use urllib to read the HTML from the data files below, extract the href= vaues from the anchor tags, scan for a tag that is in a particular position relative to the first name in the list, follow that link and repeat the process a number of times and report the last name you ...


2 Answers

Someone will probably come up with a solution that works with copy&paste and you will think that this solves your problem. Your problem isn't the code, though! It's your strategy. There is a software design principle called "divide and conquer" which you should apply in a redesign to your code: Seperate the code that interprets your HTML strings as trees/graphs from searching the nearest node (probably breadth-first-search). Not only will you learn to design better software, your problem will probably just cease to exist.

I think you are smart enough to solve this yourself, yet I also want to provide a skeleton:

def parse_html(txt):
    """ reads a string of html and returns a dict/list/tuple presentation"""
    pass

def breadth_first_search(graph, start, end):
    """ finds the shortest way from start to end
    You can probably customize start and end to work well with the input you want
    to provide. For implementation details see the link in the text above.
    """
    pass

def find_nearest_link(html,name):
    """putting it all together"""
    return breadth_first_search(parse_html(html),name,"link")

PS: Doing this also applies another principle, but from mathematics: Assuming there is a problem you don't know a solution to (finding links close to a chosen substring) and there are a group of problems that you know a solution to (graph traversal), then try to transform your problem to match the group of problems you can solve, so you can just use basic solution patterns (that are probably even implemented already in a language/framework of your choice) and you are done.

like image 81
erikbstack Avatar answered Nov 07 '22 14:11

erikbstack


Here is a solution using lxml. The main idea is to find all the preceding and following elements and then do a roundrobin iteration through those elements:

def find_nearest(elt):
    preceding = elt.xpath('preceding::*/@href')[::-1]
    following = elt.xpath('following::*/@href')
    parent = elt.xpath('parent::*/@href')
    for href in roundrobin(parent, preceding, following):
        return href

A similar solution using BeautifulSoups' (or bs4's) next_elements and previous_elements should also be possible.


import lxml.html as LH
import itertools

def find_nearest(elt):
    preceding = elt.xpath('preceding::*/@href')[::-1]
    following = elt.xpath('following::*/@href')
    parent = elt.xpath('parent::*/@href')
    for href in roundrobin(parent, preceding, following):
        return href

def roundrobin(*iterables):
    "roundrobin('ABC', 'D', 'EF') --> A D E B F C"
    # http://docs.python.org/library/itertools.html#recipes
    # Author: George Sakkis
    pending = len(iterables)
    nexts = itertools.cycle(iter(it).next for it in iterables)
    while pending:
        try:
            for n in nexts:
                yield n()
        except StopIteration:
            pending -= 1
            nexts = itertools.cycle(itertools.islice(nexts, pending))

tekst = '''
<li>
  <div class="views-field-field-webrubrik-value">
    <h3>
      <a href="/307046">Claus Hjort spiller med mrkede kort</a>
    </h3>
  </div>
  <div class="views-field-field-skribent-uid">
    <div class="byline">Af: <span class="authors">Dennis Kristensen</span></div>
  </div>
  <div class="views-field-field-webteaser-value">
    <div class="webteaser">Claus Hjort Frederiksens argumenter for at afvise
      trepartsforhandlinger har ikke hold i virkeligheden. Hans rinde er nok
      snarere at forberede det ideologiske grundlag for en Løkke Rasmussens
      genkomst som statsministe
    </div>
  </div>
  <span class="views-field-view-node">
    <span class="actions">
      <a href="/307046">Ls mere</a>
      |
      <a href="/307046/#comments">Kommentarer (4)</a>
    </span>
  </span>
</li>
'''

to_find = "Rasmussen"
doc = LH.fromstring(tekst)

for x in doc.xpath('//*[contains(text(),{s!r})]'.format(s = to_find)):
    print(find_nearest(x))

yields

/307046
like image 35
unutbu Avatar answered Nov 07 '22 14:11

unutbu