Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Passing web data into Beautiful Soup - Empty list

I've rechecked my code and looked at comparable operations on opening a URL to pass web data into Beautiful Soup, for some reason my code just doesn't return anything although it's in correct form:

>>> from bs4 import BeautifulSoup

>>> from urllib3 import poolmanager

>>> connectBuilder = poolmanager.PoolManager()

>>> content = connectBuilder.urlopen('GET', 'http://www.crummy.com/software/BeautifulSoup/')

>>> content
<urllib3.response.HTTPResponse object at 0x00000000032EC390>

>>> soup = BeautifulSoup(content)

>>> soup.title
>>> soup.title.name
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'name'
>>> soup.p
>>> soup.get_text()
''

>>> content.data
a stream of data follows...

As shown, it's clear that urlopen() returns an HTTP response which is captured by the variable content, it makes sense that it can read the status of the response, but after it's passed into Beautiful Soup, the web data doesn't get converted into a Beautiful Soup object (variable soup). You can see that I've tried to read a few tags and text, the get_text() returns an empty list, this is strange.

Strangely, when I access the web data via content.data, the data shows up but it's not useful since I can't use Beautiful Soup to parse it. What is my problem? Thanks.

like image 984
user3885774 Avatar asked Jul 31 '14 19:07

user3885774


Video Answer


2 Answers

If you just want to scrape the page, requests will get the content you need:

from bs4 import BeautifulSoup

import requests
r = requests.get('http://www.crummy.com/software/BeautifulSoup/')
soup = BeautifulSoup(r.content)

In [59]: soup.title
Out[59]: <title>Beautiful Soup: We called him Tortoise because he taught us.</title>

In [60]: soup.title.name
Out[60]: 'title'
like image 197
Padraic Cunningham Avatar answered Nov 09 '22 10:11

Padraic Cunningham


urllib3 returns a Response object, which contains the .data which has the preloaded body payload.

Per the top quickstart usage example here, I would do something like this:

import urllib3
http = urllib3.PoolManager()
response = http.request('GET', 'http://www.crummy.com/software/BeautifulSoup/')

from bs4 import BeautifulSoup
soup = BeautifulSoup(response.data)  # Note the use of the .data property
...

The rest should work as intended.

--

A little about what went wrong in your original code:

You passed the entire response object rather than the body payload. This should normally be fine because the response object is a file-like object, except in this case urllib3 already consumes all of the response and parses it for you, so that there is nothing left to .read(). It's like passing a filepointer which has already been read. .data on the other hand will access the already-read data.

If you want to use urllib3 response objects as file-like objects, you'll need to disable content preloading, like this:

response = http.request('GET', 'http://www.crummy.com/software/BeautifulSoup/', preload_content=False)
soup = BeautifulSoup(response)  # We can pass the original `response` object now.

Now it should work as you expected.

I understand that this is not very obvious behaviour, and as the author of urllib3 I apologize. :) We plan to make preload_content=False the default someday. Perhaps someday soon (I opened an issue here).

--

A quick note on .urlopen vs .request:

.urlopen assumes that you will take care of encoding any parameters passed to the request. In this case it's fine to use .urlopen because you're not passing any parameters to the request, but in general .request will do all the extra work for you so it's more convenient.

If anyone would be up for improving our documentation to this effect, that would be greatly appreciated. :) Please send a PR to https://github.com/shazow/urllib3 and add yourself as a contributor!

like image 23
shazow Avatar answered Nov 09 '22 09:11

shazow