Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why can Haskell not handle characters from a specific website?

I was wondering if I can write a Haskell program to check updates of some novels on demand, and the website I am using as an example is this. And I got a problem when displaying the contents of it (on a mac el capitan). The simple codes follow:

import Network.HTTP

openURL :: String -> IO String
openURL = (>>= getResponseBody) . simpleHTTP . getRequest

display :: String -> IO ()
display = (>>= putStrLn) . openURL

Then, when I run display "http://www.piaotian.net/html/7/7430/" on ghci, some strange characters appear; the first lines look like this:

<title>×ß½øÐÞÏÉ×îÐÂÕ½Ú,×ß½øÐÞÏÉÎÞµ¯´°È«ÎÄÔĶÁ_Æ®ÌìÎÄѧ</title>
<meta http-equiv="Content-Type" content="text/html; charset=gbk" />
<meta name="keywords" content="×ß½øÐÞÏÉ,×ß½øÐÞÏÉ×îÐÂÕ½Ú,×ß½øÐÞÏÉÎÞµ¯´° Æ®ÌìÎÄѧ" />
<meta name="description" content="Æ®ÌìÎÄѧÍøÌṩ×ß½øÐÞÏÉ×îÐÂÕ½ÚÃâ·ÑÔĶÁ£¬Ç뽫×ß½øÐÞÏÉÕ½ÚĿ¼¼ÓÈëÊղط½±ãÏ´ÎÔĶÁ,Æ®ÌìÎÄѧС˵ÔĶÁÍø¾¡Á¦ÔÚµÚһʱ¼ä¸üÐÂС˵×ß½øÐÞÏÉ£¬Èç·¢ÏÖδ¼°Ê±¸üУ¬ÇëÁªÏµÎÒÃÇ¡£" />
<meta name="copyright" content="×ß½øÐÞÏÉ°æȨÊôÓÚ×÷ÕßÎáµÀ³¤²»¹Â" />
<meta name="author" content="ÎáµÀ³¤²»¹Â" />
<link rel="stylesheet" href="/scripts/read/list.css" type="text/css" media="all" />
<script type="text/javascript">

I also tried to download as a file as follows:

import Network.HTTP

openURL :: String -> IO String
openURL = (>>= getResponseBody) . simpleHTTP . getRequest

downloading :: String -> IO ()
downloading = (>>= writeFile fileName) . openURL

But after downloading the file, it is like in the photo: enter image description here

If I download the page by python (using urllib for example) the characters are displayed normally. Also, if I write a Chinese html and parse it, then there seems to be no problem. Thus it seems that the problem is on the website. However, I don't see any difference between the characters of the site and those I write.

Any help on the reason behind this is well appreciated.

P.S.
The python code is as follows:

import urllib

urllib.urlretrieve('http://www.piaotian.net/html/7/7430/', theFic)

theFic = file_path

And the file is all fine and good.

like image 283
awllower Avatar asked Aug 01 '16 14:08

awllower


People also ask

How to use not in Haskell?

In Haskell not is a function that is used to check the variable value, to use not in Haskell we have ‘not’ keyword. Also, it is an in-built function in Haskell, so we do not require to induce any dependency or install any library. we can use this directly inside the program.

How many types of error handling are there in Haskell?

These are the four types of error handling that are standard and widely used in the Haskell world, as of 2014.

Why are numeric literals in Haskell overloaded?

Numeric literals in Haskell are overloaded so that they can represent multiple concrete types (like Int, Integer, Float or even MyOwnNumber ). You can manually chose a specific type by providing type information, like so: These three values have different types and operations performed on these will behave differently.

What is an int in Haskell?

In Haskell, Int is a type like all other types, and therefore it was easily made a member of the Num and Integral typeclasses used in (^) ( " (^) :: (Num a, Integral b) => a -> b -> a" ). Another member of those typeclasses is Integer, which supports integers of all sizes (as long as you have enough memory for their digits).


2 Answers

I'm pretty sure that if you use Network.HTTP with the String type, it converts bytes to characters using your system encoding, which is, in general, wrong.

This is only one of several reasons I don't like Network.HTTP.

Your options:

  1. Use the Bytestring interface. It's more awkward for some reason. It'll also require you to decode the bytes to characters manually. Most sites give you an encoding in the response headers, but sometimes they lie. It's a giant mess, really.

  2. Use a different http fetching library. I don't think any remove the messiness of dealing with lying encodings, but they at least don't make it more awkward to not use the system encoding incorrectly. I'd look into wreq or http-client instead.

like image 131
Carl Avatar answered Sep 20 '22 13:09

Carl


Here is an updated answer which uses the encoding package to convert the GBK encoded contents to Unicode.

#!/usr/bin/env stack
{- stack
  --resolver lts-6.0 --install-ghc runghc
  --package wreq --package lens --package encoding --package binary
-}

{-# LANGUAGE OverloadedStrings #-}

import Network.Wreq
import qualified Data.ByteString.Lazy.Char8 as LBS
import Control.Lens
import qualified Data.Encoding as E
import qualified Data.Encoding.GB18030 as E
import Data.Binary.Get

main = do
  r <- get "http://www.piaotian.net/html/7/7430/"
  let body = r ^. responseBody :: LBS.ByteString
      foo = runGet (E.decode E.GB18030) body 
  putStrLn foo
like image 30
ErikR Avatar answered Sep 20 '22 13:09

ErikR