Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

store html in python

I'm using both xpath and beautifulsoup to scrape webpage. Xpath need tree as input and beautifulsoup need soup as input. Here're the code to get tree and soup:

def get_tree(url):
    r = requests.get(url)
    tree = html.fromstring(r.content)
    return tree

# get soup
def get_soup(url):
    r = requests.get(url)
    data = r.text
    soup = BeautifulSoup(data)
    return soup

Both of these method uses requests.get(url). That's what I want to store ahead. Here's the code in python:

import requests
url = "http://www.nytimes.com/roomfordebate/2013/10/28/should-you-bribe-your-kids"
r = requests.get(url)
f = open('html','wb')
f.write(r)

And then I got error like this:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: must be convertible to a buffer, not Response

Here's the code to store the text, and I got error:

import requests
from lxml import html
url = "http://www.nytimes.com/roomfordebate/2013/02/13/when-divorce-is-a-family-affair"
r = requests.get(url)
c = r.content
outfile = open("html", "wb")
outfile.write(c)
outfile.close()
infile = open("html", "rb")
tree = html.fromstring(infile)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/lxml/html/__init__.py", line 662, in fromstring
    start = html[:10].lstrip().lower()
TypeError: 'file' object has no attribute '__getitem__'

How could I resolve this?

like image 934
f4fc2791e4473eb2ba41b5ddb445b2 Avatar asked Jun 12 '26 09:06

f4fc2791e4473eb2ba41b5ddb445b2


1 Answers

infile = open("html", "rb") #this is a file object Not a string

You need to read it first with read() and not just open :-)-

infile = open("html", "rb")
infile=infile.read()
tree = html.fromstring(infile)
like image 157
Md. Mohsin Avatar answered Jun 13 '26 21:06

Md. Mohsin