Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does Python say this Netscape cookie file isn't valid?

Tags:

python

cookies

I'm writing a Google Scholar parser, and based on this answer, I'm setting cookies before grabbing the HTML. This is the contents of my cookies.txt file:

# Netscape HTTP Cookie File
# http://curlm.haxx.se/rfc/cookie_spec.html
# This file was generated by libcurl! Edit at your own risk.

.scholar.google.com     TRUE    /       FALSE   2147483647      GSP     ID=353e8f974d766dcd:CF=2
.google.com     TRUE    /       FALSE   1317124758      PREF    ID=353e8f974d766dcd:TM=1254052758:LM=1254052758:S=_biVh02e4scrJT1H
.scholar.google.co.uk   TRUE    /       FALSE   2147483647      GSP     ID=f3f18b3b5a7c2647:CF=2
.google.co.uk   TRUE    /       FALSE   1317125123      PREF    ID=f3f18b3b5a7c2647:TM=1254053123:LM=1254053123:S=UqjRcTObh7_sARkN

and this is the code I'm using to grab the HTML:

import http.cookiejar
import urllib.request, urllib.parse, urllib.error

def get_page(url, headers="", params=""):
    filename = "cookies.txt"
    request = urllib.request.Request(url, None, headers, params)
    cookies = http.cookiejar.MozillaCookieJar(filename, None, None)
    cookies.load()
    cookie_handler = urllib.request.HTTPCookieProcessor(cookies)
    redirect_handler = urllib.request.HTTPRedirectHandler()
    opener = urllib.request.build_opener(redirect_handler,cookie_handler)
    response = opener.open(request)
    return response

start = 0
search = "Ricardo Altamirano"
results_per_fetch = 20
host = "http://scholar.google.com"
base_url = "/scholar"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; U; ru; rv:5.0.1.6) Gecko/20110501 Firefox/5.0.1 Firefox/5.0.1'}
params = urllib.parse.urlencode({'start' : start,
                                 'q': '"' + search + '"',
                                 'btnG' : "",
                                 'hl' : 'en',
                                 'num': results_per_fetch,
                                 'as_sdt' : '1,14'})

url = base_url + "?" + params
resp = get_page(host + url, headers, params)

The full traceback is:

Traceback (most recent call last):
  File "C:/Users/ricardo/Desktop/Google-Scholar/BibTex/test.py", line 29, in <module>
    resp = get_page(host + url, headers, params)
  File "C:/Users/ricardo/Desktop/Google-Scholar/BibTex/test.py", line 8, in get_page
    cookies.load()
  File "C:\Python32\lib\http\cookiejar.py", line 1767, in load
    self._really_load(f, filename, ignore_discard, ignore_expires)
  File "C:\Python32\lib\http\cookiejar.py", line 1997, in _really_load
    filename)
http.cookiejar.LoadError: 'cookies.txt' does not look like a Netscape format cookies file

I've looked around for documentation on the Netscape cookie file format, but I can't find anything that shows me the problem. Are there newlines that need to be included? I changed the line endings to Unix style, just in case, but that didn't solve the problem. The closest specification I can find is this, which doesn't indicate anything to me that I'm missing. The fields on each of the last four lines are separated by tabs, not spaces, and everything else looks correct to me.

like image 695
Ricardo Altamirano Avatar asked Jul 17 '12 19:07

Ricardo Altamirano


People also ask

What is Netscape HTTP cookie file?

The Netscape cookie file format stores one cookie per physical line in the file with a bunch of associated meta data, each field separated with TAB. That file is called the cookiejar in curl terminology. The cookie file format is text based and stores one cookie per line.

How do you add cookies in Python?

To add cookies to a request for authentication, use the header object that is passed to the get/sendRequest functions. Only the cookie name and value should be set this way. The other pieces of the cookie (domain, path, and so on) are set automatically based on the URL the request is made against.

What is cookie jar in Python?

cookiejar module defines classes for automatic handling of HTTP cookies. It is useful for accessing web sites that require small pieces of data – cookies – to be set on the client machine by an HTTP response from a web server, and then returned to the server in later HTTP requests.

How do I get all cookies in Python?

How do I get cookie data in python? Use the make_response() function to get the response object from the return value of the view function. After that, the cookie is stored using the set_cookie() function of the response object. It is easy to read back cookies.


1 Answers

I see nothing in your example code or copy of the cookies.txt file that is obviously wrong.

I've checked the source code for the MozillaCookieJar._really_load method, which throws the exception that you see.

The first thing this method does, is read the first line of the file you specified (using f.readline()) and use re.search to look for the regular expression pattern "#( Netscape)? HTTP Cookie File". This is what fails for your file.

It certainly looks like your cookies.txt would match that format, so the error you see is quite surprising.

Note that your file is opened with a simple open(filename) call earlier on, so it'll be opened in text mode with universal line ending support, meaning it doesn't matter that you are running this on Windows. The code will see \n newline terminated strings, regardless of what newline convention was used in the file itself.

What I'd do in this case is triple-check that your file's first line is really correct. It needs to either contain "# HTTP Cookie File" or "# Netscape HTTP Cookie File" (spaces only, no tabs, between the words, capitalisation matching). Test this with the python prompt:

>>> f = open('cookies.txt')
>>> line = f.readline()
>>> line
'# Netscape HTTP Cookie File\n'
>>> import re
>>> re.search("#( Netscape)? HTTP Cookie File", line)
<_sre.SRE_Match object at 0x10fecfdc8>

Python echoed the line representation back to me when I typed line at the prompt, including the \n newline character. Any surprises like tab characters or unicode zero-width spaces will show up there as escape codes. I also verified that the regular expression used by the cookiejar code matches.

You can also use the pdb python debugger to verify what the http.cookiejar module really does:

>>> import pdb
>>> import http.cookiejar
>>> jar = http.cookiejar.MozillaCookieJar('cookies.txt')
>>> pdb.run('jar.load()')
> <string>(1)<module>()
(Pdb) s
--Call--
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1759)load()
-> def load(self, filename=None, ignore_discard=False, ignore_expires=False):
(Pdb) s
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1761)load()
-> if filename is None:
(Pdb) s
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1762)load()
-> if self.filename is not None: filename = self.filename
(Pdb) s
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1765)load()
-> f = open(filename)
(Pdb) n
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1766)load()
-> try:
(Pdb) 
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1767)load()
-> self._really_load(f, filename, ignore_discard, ignore_expires)
(Pdb) s
--Call--
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1989)_really_load()
-> def _really_load(self, f, filename, ignore_discard, ignore_expires):
(Pdb) s
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1990)_really_load()
-> now = time.time()
(Pdb) n
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1992)_really_load()
-> magic = f.readline()
(Pdb) 
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1993)_really_load()
-> if not self.magic_re.search(magic):
(Pdb) 
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1999)_really_load()
-> try:

In the above sample pdb session I used a combination of the step and next commands to verify that the regular expression test (self.magic_re.search(magic)) actually passed.

like image 51
Martijn Pieters Avatar answered Nov 02 '22 09:11

Martijn Pieters