Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elegant way to try/except a series of BeautifulSoup commands?

I'm parsing webpages on a site displaying item data. These items have about 20 fields which may or may not occur -- say: price, quantity, last purchased, high, low, etc.

I'm currently using a series of commands; about 20 lines of soup.find('div',{'class':SOME_FIELD_OF_INTEREST}) to look for each separate field of interest. (Some are in div, span, dd, and so on, so it's difficult to just do a soup.find_all('div') command.)

My question: Is there an elegant way to try and except everything such that the viewing of said code can be more compact or concise? Right now a sample line would look like:

try:
    soup.find('div', {'id':'item-pic'}).img["src"]
except:
    ""

I was hoping to combine everything in one line. I don't think I can syntactically run try: <line of code> except: <code>, and I'm not sure how I'd write a function that goes try_command(soup.find('div',{'id':'item-pic'}).img["src"]) without actually running the command.

I'd love to hear if anybody has any advice (including: "this isn't possible/practical, move on"). :)

EDIT: After talking a bit, I guess I wanted to see what is good practice for inline exception handling, and if that's the right route to take.

like image 250
binarysolo Avatar asked Dec 09 '12 01:12

binarysolo


People also ask

How do I exclude tags in BeautifulSoup?

Answer #1: You can use extract() to remove unwanted tag before you get text. But it keeps all 'n' and spaces so you will need some work to remove them. You can skip every Tag object inside external span and keep only NavigableString objects (it is plain text in HTML).

What is the difference between Find_all () and find () in BeautifulSoup?

find is used for returning the result when the searched element is found on the page. find_all is used for returning all the matches after scanning the entire document.

What does Find_all return BeautifulSoup?

Beautiful Soup's find_all(~) method returns a list of all the tags or strings that match a particular criteria.


1 Answers

maybe something like that:

def try_these(start_obj, *args) :
        obj = start_obj
        for trythat in args :
            if obj is None :
                return None
            try :
                if isinstance(trythat, str) :
                    obj = getattr(obj, trythat)
                else :
                    method, opts = trythat
                    obj = getattr(obj, method)(*opts)
            except :
                return None
        return obj    
src = try_these(soup, ('find', ({'id':'item-pic'},),), 
                      'img', 
                      ('get', ('src',),) )

where you can pass str to get attribute from object or tuple (str method, tuple params), finally you'll get None or result. I'm not familiar with soup so I'm not sure if get('src') would be a good approach (as probably its not a dict), anyway you can easily modify that snippet to accept something more than only 'call or attr'.


Inspired by your question I wrote simple python module that helps to deal with such situation, you can find it here

import silentcrawler    

wrapped = silentcrawler.wrap(soup)
# just return None on failure
print wrapped.find('div', {'id':'item-pic'}).img["src"].value_

# or
def on_success(value) :
    print 'found value:', value
wrapped = silentcrawler.wrap(soup, success=on_success)
# call on_success if everything will be ok
wrapped.find('div', {'id':'item-pic'}).img["src"].value_ 

there is more possibilities

like image 162
lupatus Avatar answered Sep 21 '22 23:09

lupatus