store html in python

Question

I'm using both xpath and beautifulsoup to scrape webpage. Xpath need tree as input and beautifulsoup need soup as input. Here're the code to get tree and soup:

def get_tree(url):
    r = requests.get(url)
    tree = html.fromstring(r.content)
    return tree

# get soup
def get_soup(url):
    r = requests.get(url)
    data = r.text
    soup = BeautifulSoup(data)
    return soup

Both of these method uses requests.get(url). That's what I want to store ahead. Here's the code in python:

import requests
url = "http://www.nytimes.com/roomfordebate/2013/10/28/should-you-bribe-your-kids"
r = requests.get(url)
f = open('html','wb')
f.write(r)

And then I got error like this:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: must be convertible to a buffer, not Response

Here's the code to store the text, and I got error:

import requests
from lxml import html
url = "http://www.nytimes.com/roomfordebate/2013/02/13/when-divorce-is-a-family-affair"
r = requests.get(url)
c = r.content
outfile = open("html", "wb")
outfile.write(c)
outfile.close()
infile = open("html", "rb")
tree = html.fromstring(infile)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/lxml/html/__init__.py", line 662, in fromstring
    start = html[:10].lstrip().lower()
TypeError: 'file' object has no attribute '__getitem__'

How could I resolve this?

Md. Mohsin · Accepted Answer

infile = open("html", "rb") #this is a file object Not a string

You need to read it first with read() and not just open :-)-

infile = open("html", "rb")
infile=infile.read()
tree = html.fromstring(infile)

store html in python

Tags:

python

html

beautifulsoup

lxml

lxml.html

f4fc2791e4473eb2ba41b5ddb445b2

1 Answers

Md. Mohsin

Recent Activity

Donate For Us

store html in python

Tags:

python

html

beautifulsoup

lxml

lxml.html

f4fc2791e4473eb2ba41b5ddb445b2

1 Answers

Md. Mohsin

Related questions

Recent Activity

Donate For Us