Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

lxml error "IOError: Error reading file" when parsing facebook mobile in a python scraper script

I use a modified script from Logging into facebook with python post :

#!/usr/bin/python2 -u
# -*- coding: utf8 -*-

facebook_email = "[email protected]"
facebook_passwd = "YOUR_PASSWORD"


import cookielib, urllib2, urllib, time, sys
from lxml import etree

jar = cookielib.CookieJar()
cookie = urllib2.HTTPCookieProcessor(jar)       
opener = urllib2.build_opener(cookie)

headers = {
    "User-Agent" : "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_0 like Mac OS X; en-us) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8A293 Safari/6531.22.7",
    "Accept" : "text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,text/png,*/*;q=0.5",
    "Accept-Language" : "en-us,en;q=0.5",
    "Accept-Charset" : "utf-8",
    "Content-type": "application/x-www-form-urlencoded",
    "Host": "m.facebook.com"
}

try:
    params = urllib.urlencode({'email':facebook_email,'pass':facebook_passwd,'login':'Log+In'})
    req = urllib2.Request('http://m.facebook.com/login.php?m=m&refsrc=m.facebook.com%2F', params, headers)
    res = opener.open(req)
    html = res.read()

except urllib2.HTTPError, e:
    print e.msg
except urllib2.URLError, e:
    print e.reason[1]

def fetch(url):
    req = urllib2.Request(url,None,headers)
    res = opener.open(req)
    return res.read()

body = unicode(fetch("http://www.facebook.com/photo.php?fbid=404284859586659&set=a.355112834503862.104278.354259211255891&type=1"), errors='ignore')
tree = etree.parse(body)
r = tree.xpath('/see_prev')
print r.text

When I execute the code, problems appears :

$ ./facebook_fetch_coms.py
Traceback (most recent call last):
  File "./facebook_fetch_coms_classic_test.py", line 42, in <module>
    tree = etree.parse(body)
  File "lxml.etree.pyx", line 2957, in lxml.etree.parse (src/lxml/lxml.etree.c:56230)
  File "parser.pxi", line 1533, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:82313)
  File "parser.pxi", line 1562, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:82606)
  File "parser.pxi", line 1462, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:81645)
  File "parser.pxi", line 1002, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:78554)
  File "parser.pxi", line 569, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:74498)
  File "parser.pxi", line 650, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:75389)
  File "parser.pxi", line 588, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74691)
IOError: Error reading file '<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//WAPFORUM//DTD XHTML Mobile 1.0//EN" "http://www.wapforum.org/DTD/xhtml-mobile10.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"><head><title>Facebook</title><meta name="description" content="Facebook helps you connect and share with the people in your life."

The goal is first to get the link with id=see_prev with lxml, then using a while loop to open all comments, to finally fetch all messages in a file. Any help will be very appreciated !

Edit: I use Python 2.7.2 on archlinux x86_64 and lxml 2.3.3.

like image 846
Gilles Quenot Avatar asked Mar 07 '12 00:03

Gilles Quenot


1 Answers

This is your problem:

tree = etree.parse(body)

The documentation says that "source is a filename or file object containing XML data." You have provided a string, so lxml is taking the text of your HTTP response body as the name of the file you wish to open. No such file exists, so you get an IOError.

The error message you get even says "Error reading file" and then gives your XML string as the name of the file it's trying to read, which is a mighty big hint about what's going on.

You probably want etree.XML(), which takes input from a string. Or you could just do tree = etree.parse(res) to read directly from the HTTP request into lxml (the result of opener.open() is a file-like object, and etree.parse() should be perfectly happy to consume it).

like image 69
kindall Avatar answered Oct 22 '22 06:10

kindall