Can't access Project Gutenberg raw text

Tags:

I have a problem accessing the Project Gutenberg Library... I am using Python 2.7.3. I can access the NLTK library and work with python, but when attempting to access raw text, it doesn't allow me to.

The text I was accessing is Crime and Punishment, it's len(raw) should equal 1176831, but instead gives me a len(raw) of 288. Here is the code that I used:

Click to copy

>>> from __future__ import division
>>> import nltk, re, pprint
>>> from urllib import urlopen
>>> url = "http://www.gutenberg.org/files/2554/2554.txt"
>>> raw = urlopen(url).read()
>>> type(raw)
<type 'str'>
>>> len(raw)
288
>>> raw
'<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">\n<html><head>\n<title>403 Forbidden</title>\n</head><body>\n<h1>Forbidden</h1>\n<p>You don\'t have permission to access /files/2554/2554.txt\non this server.</p>\n<hr>\n<address>Apache Server at www.gutenberg.org Port 80</address>\n</body></html>\n'
>>>

638

asked Nov 05 '12 03:11

user1799092

2 Answers

The reason for the HTTP 403 response can be found here. Basically the site is "for human (non-automated) users only. Any perceived use of automated tools to access our web site will result in a temporary or permanent block of your IP address or subnet."

Your code "should work", but the website is determining you are accessing the site through code and not a browser. That is all I will say. :)

183

answered Sep 30 '22 05:09

Ray Toal

Click to copy

from urllib import urlopen

url = "http://www.gutenberg.org/files/2554/2554**-0**.txt"

raw = urlopen(url).read()

answered Sep 30 '22 04:09

Valentin Vrzheshch

Related questions
                            
                                Simple 2d surface with arrow in python?
                            
                                removing particular rows from DataFrame in python pandas
                            
                                How to store static text on a website with django
                            
                                seek to regex in a large file using python
                            
                                Python Coding style Wrapping Lines
                            
                                With ec2 python API boto, how to get spot instance_id from SpotInstanceRequest?
                            
                                AttributeError in callback function
                            
                                List of parents objects and their children with fewer queries
                            
                                How to set date tick labels on x axis, only for given points on matplotlib
                            
                                Distinction between Default Argument Values and Keyword Arguments?
                            
                                error occurs when I write my own C extension for numpy
                            
                                Python Watchdog issue - missing events
                            
                                Avoiding partially written files in Python
                            
                                convert list of string and number to string and float
                            
                                How to add a track to an iTunes playlist using Python and Scripting Bridge
                            
                                Avoid having two different numeric subclasses (int and long)?
                            
                                How to reliably locate Java's rt.jar or equivalent?
                            
                                Using inspect.getmembers
                            
                                How can I create my own datatype in python so that I could overwrite arithmetic operators?
                            
                                pybrain poor results

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Can't access Project Gutenberg raw text

Tags:

python

urllib

python-2.7

user1799092

People also ask

2 Answers

Ray Toal

Valentin Vrzheshch

Recent Activity

Donate For Us