Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Issue with scraping site with foreign characters

I need help with a scraper I'm writing. I'm trying to scrape a table of university rankings, and some of those schools are European universities with foreign characters in their names (e.g. ä, ü). I'm already scraping another table on another site with foreign universities in the exact same way, and everything works fine. But for some reason, the current scraper won't work with foreign characters (and as far as parsing foreign characters, the two scrapers are exactly the same).

Here's what I'm doing to try & make things work:

  1. Declare encoding on the very first line of the file:

    # -*- coding: utf-8 -*-
    
  2. Importing & using smart unicode from django framework from django.utils.encoding import smart_unicode

    school_name = smart_unicode(html_elements[2].text_content(), encoding='utf-8',        
    strings_only=False, errors='strict').encode('utf-8')
    
  3. Use encode function, as seen above when chained with the smart_unicode function. I can't think of what else I could be doing wrong. Before dealing with these scrapers, I really didn't understand much about different encoding, so it's been a bit of an eye-opening experience. I've tried reading the following, but still can't overcome this problem

    • http://farmdev.com/talks/unicode/

    • http://www.joelonsoftware.com/articles/Unicode.html

I understand that in an encoding, every character is assigned a number, which can be expressed in hex, binary, etc. Different encodings have different capacities for how many languages they support (e.g. ASCII only supports English, UTF-8 supports everything it seems. However, I feel like I'm doing everything necessary to ensure the characters are printed correctly. I don't know where my mistake is, and it's driving me crazy. Please help!!

like image 384
user642547 Avatar asked Jun 05 '12 12:06

user642547


People also ask

How do you know if a website doesn't allow scraping?

In order to check whether the website supports web scraping, you should append “/robots. txt” to the end of the URL of the website you are targeting. In such a case, you have to check on that special site dedicated to web scraping. Always be aware of copyright and read up on fair use.

Why web scraping is difficult?

Web scraping is easy! Anyone even without any knowledge of coding can scrape data if they are given the right tool. Programming doesn't have to be the reason you are not scraping the data you need. There are various tools, such as Octoparse, designed to help non-programmers scrape websites for relevant data.

Does a website allow scraping?

From all the above discussion, it can be concluded that Web Scraping is actually not illegal on its own but one should be ethical while doing it. If done in a good way, Web Scraping can help us to make the best use of the web, the biggest example of which is Google Search Engine.


2 Answers

When extracting information from a web page, you need to determine its character encoding, similarly to how browsers do such things (analyzing HTTP headers, parsing HTML to find meta tags, and possibly guesswork based on the actual data, e.g. the presence of something that looks like BOM in some encoding). Hopefully you can find a library routine that does this for you.

In any case, you should not expect all web sites to be utf-8 encoded. Iso-8859-1 is still in widespread use, and in general reading iso-8859-1 as if it were utf-8 results in a big mess (for any non-Ascii characters).

like image 175
Jukka K. Korpela Avatar answered Oct 02 '22 03:10

Jukka K. Korpela


If you are using the requests library it will automatically decode the content based on HTTP headers. Getting the HTML content of a page is really easy:

>>> import requests
>>> r = requests.get('https://github.com/timeline.json')
>>> r.text
'[{"repository":{"open_issues":0,"url":"https://github.com/...
like image 42
schlamar Avatar answered Oct 02 '22 04:10

schlamar