I have a problem accessing the Project Gutenberg Library... I am using Python 2.7.3. I can access the NLTK library and work with python, but when attempting to access raw text, it doesn't allow me to.
The text I was accessing is Crime and Punishment, it's len(raw) should equal 1176831, but instead gives me a len(raw) of 288. Here is the code that I used:
>>> from __future__ import division
>>> import nltk, re, pprint
>>> from urllib import urlopen
>>> url = "http://www.gutenberg.org/files/2554/2554.txt"
>>> raw = urlopen(url).read()
>>> type(raw)
<type 'str'>
>>> len(raw)
288
>>> raw
'<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">\n<html><head>\n<title>403 Forbidden</title>\n</head><body>\n<h1>Forbidden</h1>\n<p>You don\'t have permission to access /files/2554/2554.txt\non this server.</p>\n<hr>\n<address>Apache Server at www.gutenberg.org Port 80</address>\n</body></html>\n'
>>>
On a computer, go to Project Gutenberg and search for the book you want. Click on the title of the book you want to get a list of the downloadable file types. Click on the Kindle version (there may be a version with pictures and one without and you can choose either). Then choose to save the file.
If your device is Internet-enabled, just visit the catalog landing page for a book, and download one of the formats your device can display. Here is a sample catalog landing page: www.gutenberg.org/ebooks/11. Use the author/title search boxes on every page at www.gutenberg.org to find eBooks you are interested in.
Most books in the Project Gutenberg collection are distributed as public domain under United States copyright law. There are also a few copyrighted texts, such as those of science fiction author Cory Doctorow, that Project Gutenberg distributes with permission.
MLA Style recommends citing a Project Gutenberg book as a page from a website: Author last name, Author first name. “Title of Book.” Project Gutenberg, Publication/Updated date, URL.
The reason for the HTTP 403 response can be found here. Basically the site is "for human (non-automated) users only. Any perceived use of automated tools to access our web site will result in a temporary or permanent block of your IP address or subnet."
Your code "should work", but the website is determining you are accessing the site through code and not a browser. That is all I will say. :)
from urllib import urlopen
url = "http://www.gutenberg.org/files/2554/2554**-0**.txt"
raw = urlopen(url).read()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With