I need help with a scraper I'm writing. I'm trying to scrape a table of university rankings, and some of those schools are European universities with foreign characters in their names (e.g. ä, ü). I'm already scraping another table on another site with foreign universities in the exact same way, and everything works fine. But for some reason, the current scraper won't work with foreign characters (and as far as parsing foreign characters, the two scrapers are exactly the same). Here's what I'm doing to try & make things work: <ol> <li> Declare encoding on the very first line of the file: <pre class="prettyprint"><code># -*- coding: utf-8 -*- </code></pre> </li> <li> Importing & using smart unicode from django framework from django.utils.encoding import smart_unicode <pre class="prettyprint"><code>school_name = smart_unicode(html_elements[2].text_content(), encoding='utf-8', strings_only=False, errors='strict').encode('utf-8') </code></pre> </li> <li> Use encode function, as seen above when chained with the smart_unicode function. I can't think of what else I could be doing wrong. Before dealing with these scrapers, I really didn't understand much about different encoding, so it's been a bit of an eye-opening experience. I've tried reading the following, but still can't overcome this problem <ul> <li>http://farmdev.com/talks/unicode/</li> <li>http://www.joelonsoftware.com/articles/Unicode.html</li> </ul> </li> </ol> I understand that in an encoding, every character is assigned a number, which can be expressed in hex, binary, etc. Different encodings have different capacities for how many languages they support (e.g. ASCII only supports English, UTF-8 supports everything it seems. However, I feel like I'm doing everything necessary to ensure the characters are printed correctly. I don't know where my mistake is, and it's driving me crazy. Please help!!

If you are using the requests library it will automatically decode the content based on HTTP headers. Getting the HTML content of a page is really easy: <pre class="prettyprint"><code>>>> import requests >>> r = requests.get('https://github.com/timeline.json') >>> r.text '[{"repository":{"open_issues":0,"url":"https://github.com/... </code></pre>

Issue with scraping site with foreign characters

Tags:

python

encoding

unicode

web-scraping

I need help with a scraper I'm writing. I'm trying to scrape a table of university rankings, and some of those schools are European universities with foreign characters in their names (e.g. ä, ü). I'm already scraping another table on another site with foreign universities in the exact same way, and everything works fine. But for some reason, the current scraper won't work with foreign characters (and as far as parsing foreign characters, the two scrapers are exactly the same).

Here's what I'm doing to try & make things work:

Declare encoding on the very first line of the file:
```
# -*- coding: utf-8 -*-
```

Importing & using smart unicode from django framework from django.utils.encoding import smart_unicode

school_name = smart_unicode(html_elements[2].text_content(), encoding='utf-8',        
strings_only=False, errors='strict').encode('utf-8')

Use encode function, as seen above when chained with the smart_unicode function. I can't think of what else I could be doing wrong. Before dealing with these scrapers, I really didn't understand much about different encoding, so it's been a bit of an eye-opening experience. I've tried reading the following, but still can't overcome this problem
- http://farmdev.com/talks/unicode/
- http://www.joelonsoftware.com/articles/Unicode.html

I understand that in an encoding, every character is assigned a number, which can be expressed in hex, binary, etc. Different encodings have different capacities for how many languages they support (e.g. ASCII only supports English, UTF-8 supports everything it seems. However, I feel like I'm doing everything necessary to ensure the characters are printed correctly. I don't know where my mistake is, and it's driving me crazy. Please help!!

384

asked Jun 05 '12 12:06

user642547

2 Answers

When extracting information from a web page, you need to determine its character encoding, similarly to how browsers do such things (analyzing HTTP headers, parsing HTML to find meta tags, and possibly guesswork based on the actual data, e.g. the presence of something that looks like BOM in some encoding). Hopefully you can find a library routine that does this for you.

In any case, you should not expect all web sites to be utf-8 encoded. Iso-8859-1 is still in widespread use, and in general reading iso-8859-1 as if it were utf-8 results in a big mess (for any non-Ascii characters).

175

answered Oct 02 '22 03:10

Jukka K. Korpela

If you are using the requests library it will automatically decode the content based on HTTP headers. Getting the HTML content of a page is really easy:

>>> import requests
>>> r = requests.get('https://github.com/timeline.json')
>>> r.text
'[{"repository":{"open_issues":0,"url":"https://github.com/...

answered Oct 02 '22 04:10

schlamar

Related questions
                            
                                Bpython-like editor/IDE?
                            
                                How to change `object_name` of a model in Django
                            
                                Using `issubclass()` with Django models
                            
                                How to shutdown a timed out http POST using urlopen by urllib2 in Python?
                            
                                HDF5 for Python: high level vs low level interfaces. h5py
                            
                                Need help installing python package autopy on mac os x - difficulty with libpng and png.h
                            
                                Python-Matplotlib boxplot. How to show percentiles 0,10,25,50,75,90 and 100?
                            
                                matplotlib barh produces wonky spacing between bars
                            
                                Display image without gtk
                            
                                Error :No module named qgis
                            
                                How to package a program to share with people?
                            
                                Python: Two packages with the same name; how do you specify which is loaded?
                            
                                Python Web Framework most like ASP.NET MVC 3
                            
                                Why is the memory usage of a child (python multiprocessing) process so different when sharing a ctypes.Structure with a string vs. only a string?
                            
                                Pyparsing, parsing the contents of php function comment blocks using nested parsers
                            
                                CouchBase mixed with Memcached, loss of most CouchDB philosophies and functionality?
                            
                                Python Pandas: can't find numpy.core.multiarray when importing pandas
                            
                                hook into wndproc of another application?
                            
                                When to programmatically create custom Django permissions?
                            
                                Using an Access database (.mdb) with Python on Ubuntu [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With