Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Beautiful Soup 'ascii' codec can't encode character u'\xa5'

Iam encountering some weird characters while web scraping some elements of the page . The characters that seem to give error are :

? ????Á¢¢Á? /?? />? /??? ?/¢¥Á ??%% ?Á ?????Á? ?> /???¥??> ¥? ¥©Á ?>¢¥/%%/¥??> ? >Á? Â?Á ©???¢ ñ%Á?¥???/% Á%Á?¥??>?? />? Â??Á? ??¥?? ??¢¥????¥??> ¢`¢¥Á¢ ??%% ?Á ??À?/?Á? ¥? _ÁÁ¥ ?>??Á/¢?>À Á????Á>¥ ????¥Á? />? ??__?>??/¥??>¢ ?Á

My code concerned is as below

url= "http://www.nsf.gov#######@#@#@##";
    #webbrowser.open(url,new =new );
    flagcnt+=1
    if flagcnt%20==0: #autosleep for avoiding shut-out
        print "flagcount: "
        print flagcnt
        time.sleep(5)
     #Program Code extraction
    r = requests.get (url)
    sp=BeautifulSoup(r.content)

Page : http://www.nsf.gov/awardsearch

Iv read all pages on this error with some which suggest decoding and encoding but they dont seem to help.I dont know which encoding is being used here .Tried downgrading BS version but didnt help . Any help is appreciated . Python 2.7 BS 4

like image 968
Pulkit Bhardwaj Avatar asked Apr 17 '15 00:04

Pulkit Bhardwaj


1 Answers

This works for me:

page_text = r.text.encode('utf-8').decode('ascii', 'ignore')
page_soupy = BeautifulSoup.BeautifulSoup(page_text)
like image 113
nivix zixer Avatar answered Nov 15 '22 03:11

nivix zixer