Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

label empty or too long - python urllib2

Tags:

python

urllib2

I am having a strange situation:

i am curling urls like this:

def check_urlstatus(url):
  h = httplib2.Http()
  try:
      resp = h.request("http://" + url, 'HEAD')        
      if int(resp[0]['status']) < 400:
          return 'ok'
      else:
          return 'bad'
  except httplib2.ServerNotFoundError:
      return 'bad'

if I try to test this with:

if check_urlstatus('.f.de') == "bad": #<--- error happening here
   #..
   #..

it is saying:

UnicodeError: label empty or too long

what is the problem i am causing here?

EDIT: here is the traceback with idna. I guess, it tries to split the input by . and in this case, first label is empty which is the pace before the first ..

enter image description here

like image 530
doniyor Avatar asked Aug 03 '14 08:08

doniyor


1 Answers

The problem is your URL cannot properly be encoded as per the IDNA rules, which govern how internationalized domain names are converted:

The conversions between ASCII and non-ASCII forms of a domain name are accomplished by algorithms called ToASCII and ToUnicode. These algorithms are not applied to the domain name as a whole, but rather to individual labels. For example, if the domain name is www.example.com, then the labels are www, example, and com. ToASCII or ToUnicode are applied to each of these three separately.

The details of these two algorithms are complex, and are specified in RFC 3490. The following gives an overview of their function.

ToASCII leaves unchanged any ASCII label, but will fail if the label is unsuitable for the Domain Name System. If given a label containing at least one non-ASCII character, ToASCII will apply the Nameprep algorithm, which converts the label to lowercase and performs other normalization, and will then translate the result to ASCII using Punycode[16] before prepending the four-character string "xn--".[17] This four-character string is called the ASCII Compatible Encoding (ACE) prefix, and is used to distinguish Punycode encoded labels from ordinary ASCII labels. The ToASCII algorithm can fail in several ways; for example, the final string could exceed the 63-character limit of a DNS name. A label for which ToASCII fails cannot be used in an internationalized domain name.

In your case a '' (blank) is not a valid domain name character, and you end up with this:

>>> '.f.de'.encode('idna')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.6/encodings/idna.py", line 164, in encode
    result.append(ToASCII(label))
  File "/usr/lib/python2.6/encodings/idna.py", line 73, in ToASCII
    raise UnicodeError("label empty or too long")
UnicodeError: label empty or too long

If you change the domain name to 'a.f.de' it should not raise this exception.

like image 103
Burhan Khalid Avatar answered Nov 10 '22 13:11

Burhan Khalid