Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can't access Project Gutenberg raw text

I have a problem accessing the Project Gutenberg Library... I am using Python 2.7.3. I can access the NLTK library and work with python, but when attempting to access raw text, it doesn't allow me to.

The text I was accessing is Crime and Punishment, it's len(raw) should equal 1176831, but instead gives me a len(raw) of 288. Here is the code that I used:

>>> from __future__ import division
>>> import nltk, re, pprint
>>> from urllib import urlopen
>>> url = "http://www.gutenberg.org/files/2554/2554.txt"
>>> raw = urlopen(url).read()
>>> type(raw)
<type 'str'>
>>> len(raw)
288
>>> raw
'<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">\n<html><head>\n<title>403 Forbidden</title>\n</head><body>\n<h1>Forbidden</h1>\n<p>You don\'t have permission to access /files/2554/2554.txt\non this server.</p>\n<hr>\n<address>Apache Server at www.gutenberg.org Port 80</address>\n</body></html>\n'
>>> 
like image 638
user1799092 Avatar asked Nov 05 '12 03:11

user1799092


People also ask

How do I download a text file from Project Gutenberg?

On a computer, go to Project Gutenberg and search for the book you want. Click on the title of the book you want to get a list of the downloadable file types. Click on the Kindle version (there may be a version with pictures and one without and you can choose either). Then choose to save the file.

How do I access Project Gutenberg?

If your device is Internet-enabled, just visit the catalog landing page for a book, and download one of the formats your device can display. Here is a sample catalog landing page: www.gutenberg.org/ebooks/11. Use the author/title search boxes on every page at www.gutenberg.org to find eBooks you are interested in.

Is Project Gutenberg legal?

Most books in the Project Gutenberg collection are distributed as public domain under United States copyright law. There are also a few copyrighted texts, such as those of science fiction author Cory Doctorow, that Project Gutenberg distributes with permission.

How do you in text cite Project Gutenberg?

MLA Style recommends citing a Project Gutenberg book as a page from a website: Author last name, Author first name. “Title of Book.” Project Gutenberg, Publication/Updated date, URL.


2 Answers

The reason for the HTTP 403 response can be found here. Basically the site is "for human (non-automated) users only. Any perceived use of automated tools to access our web site will result in a temporary or permanent block of your IP address or subnet."

Your code "should work", but the website is determining you are accessing the site through code and not a browser. That is all I will say. :)

like image 183
Ray Toal Avatar answered Sep 30 '22 05:09

Ray Toal


from urllib import urlopen

url = "http://www.gutenberg.org/files/2554/2554**-0**.txt"

raw = urlopen(url).read()
like image 43
Valentin Vrzheshch Avatar answered Sep 30 '22 04:09

Valentin Vrzheshch