I'm parsing webpages on a site displaying item data. These items have about 20 fields which may or may not occur -- say: price, quantity, last purchased, high, low, etc.
I'm currently using a series of commands; about 20 lines of soup.find('div',{'class':SOME_FIELD_OF_INTEREST})
to look for each separate field of interest. (Some are in div
, span
, dd
, and so on, so it's difficult to just do a soup.find_all('div')
command.)
My question: Is there an elegant way to try
and except
everything such that the viewing of said code can be more compact or concise? Right now a sample line would look like:
try:
soup.find('div', {'id':'item-pic'}).img["src"]
except:
""
I was hoping to combine everything in one line. I don't think I can syntactically run try: <line of code> except: <code>
, and I'm not sure how I'd write a function that goes try_command(soup.find('div',{'id':'item-pic'}).img["src"])
without actually running the command.
I'd love to hear if anybody has any advice (including: "this isn't possible/practical, move on"). :)
EDIT: After talking a bit, I guess I wanted to see what is good practice for inline exception handling, and if that's the right route to take.
Answer #1: You can use extract() to remove unwanted tag before you get text. But it keeps all 'n' and spaces so you will need some work to remove them. You can skip every Tag object inside external span and keep only NavigableString objects (it is plain text in HTML).
find is used for returning the result when the searched element is found on the page. find_all is used for returning all the matches after scanning the entire document.
Beautiful Soup's find_all(~) method returns a list of all the tags or strings that match a particular criteria.
maybe something like that:
def try_these(start_obj, *args) :
obj = start_obj
for trythat in args :
if obj is None :
return None
try :
if isinstance(trythat, str) :
obj = getattr(obj, trythat)
else :
method, opts = trythat
obj = getattr(obj, method)(*opts)
except :
return None
return obj
src = try_these(soup, ('find', ({'id':'item-pic'},),),
'img',
('get', ('src',),) )
where you can pass str
to get attribute from object or tuple
(str method, tuple params), finally you'll get None
or result. I'm not familiar with soup so I'm not sure if get('src')
would be a good approach (as probably its not a dict), anyway you can easily modify that snippet to accept something more than only 'call or attr'.
Inspired by your question I wrote simple python module that helps to deal with such situation, you can find it here
import silentcrawler
wrapped = silentcrawler.wrap(soup)
# just return None on failure
print wrapped.find('div', {'id':'item-pic'}).img["src"].value_
# or
def on_success(value) :
print 'found value:', value
wrapped = silentcrawler.wrap(soup, success=on_success)
# call on_success if everything will be ok
wrapped.find('div', {'id':'item-pic'}).img["src"].value_
there is more possibilities
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With