urllib2 read to Unicode

Question

I need to store the content of a site that can be in any language. And I need to be able to search the content for a Unicode string.

I have tried something like:

import urllib2  req = urllib2.urlopen('http://lenta.ru') content = req.read()

The content is a byte stream, so I can search it for a Unicode string.

I need some way that when I do urlopen and then read to use the charset from the headers to decode the content and encode it into UTF-8.

Alex Martelli · Accepted Answer

After the operations you performed, you'll see:

>>> req.headers['content-type'] 'text/html; charset=windows-1251'

and so:

>>> encoding=req.headers['content-type'].split('charset=')[-1] >>> ucontent = unicode(content, encoding)

ucontent is now a Unicode string (of 140655 characters) -- so for example to display a part of it, if your terminal is UTF-8:

>>> print ucontent[76:110].encode('utf-8') <title>Lenta.ru: Главное: </title>

and you can search, etc, etc.

Edit: Unicode I/O is usually tricky (this may be what's holding up the original asker) but I'm going to bypass the difficult problem of inputting Unicode strings to an interactive Python interpreter (completely unrelated to the original question) to show how, once a Unicode string IS correctly input (I'm doing it by codepoints -- goofy but not tricky;-), search is absolutely a no-brainer (and thus hopefully the original question has been thoroughly answered). Again assuming a UTF-8 terminal:

>>> x=u'\u0413\u043b\u0430\u0432\u043d\u043e\u0435' >>> print x.encode('utf-8') Главное >>> x in ucontent True >>> ucontent.find(x) 93

Note: Keep in mind that this method may not work for all sites, since some sites only specify character encoding inside the served documents (using http-equiv meta tags, for example).

jfs · Answer

To parse Content-Type http header, you could use cgi.parse_header function:

import cgi import urllib2  r = urllib2.urlopen('http://lenta.ru') _, params = cgi.parse_header(r.headers.get('Content-Type', '')) encoding = params.get('charset', 'utf-8') unicode_text = r.read().decode(encoding)

Another way to get the charset:

>>> import urllib2 >>> r = urllib2.urlopen('http://lenta.ru') >>> r.headers.getparam('charset') 'utf-8'

Or in Python 3:

>>> import urllib.request >>> r = urllib.request.urlopen('http://lenta.ru') >>> r.headers.get_content_charset() 'utf-8'

Character encoding can also be specified inside html document e.g., <meta charset="utf-8">.

urllib2 read to Unicode

Tags:

python

unicode

urllib2

Vitaly Babiy

2 Answers

Alex Martelli

jfs

Recent Activity

Donate For Us

urllib2 read to Unicode

Tags:

python

unicode

urllib2

Vitaly Babiy

2 Answers

Alex Martelli

jfs

Related questions

Recent Activity

Donate For Us