I was wondering if I can write a Haskell program to check updates of some novels on demand, and the website I am using as an example is this. And I got a problem when displaying the contents of it (on a mac el capitan). The simple codes follow:
import Network.HTTP
openURL :: String -> IO String
openURL = (>>= getResponseBody) . simpleHTTP . getRequest
display :: String -> IO ()
display = (>>= putStrLn) . openURL
Then, when I run display "http://www.piaotian.net/html/7/7430/"
on ghci, some strange characters appear; the first lines look like this:
<title>×ß½øÐÞÏÉ×îÐÂÕ½Ú,×ß½øÐÞÏÉÎÞµ¯´°È«ÎÄÔĶÁ_Æ®ÌìÎÄѧ</title>
<meta http-equiv="Content-Type" content="text/html; charset=gbk" />
<meta name="keywords" content="×ß½øÐÞÏÉ,×ß½øÐÞÏÉ×îÐÂÕ½Ú,×ß½øÐÞÏÉÎÞµ¯´° Æ®ÌìÎÄѧ" />
<meta name="description" content="Æ®ÌìÎÄѧÍøÌṩ×ß½øÐÞÏÉ×îÐÂÕ½ÚÃâ·ÑÔĶÁ£¬Ç뽫×ß½øÐÞÏÉÕ½ÚĿ¼¼ÓÈëÊղط½±ãÏ´ÎÔĶÁ,Æ®ÌìÎÄѧС˵ÔĶÁÍø¾¡Á¦ÔÚµÚһʱ¼ä¸üÐÂС˵×ß½øÐÞÏÉ£¬Èç·¢ÏÖδ¼°Ê±¸üУ¬ÇëÁªÏµÎÒÃÇ¡£" />
<meta name="copyright" content="×ß½øÐÞÏÉ°æȨÊôÓÚ×÷ÕßÎáµÀ³¤²»¹Â" />
<meta name="author" content="ÎáµÀ³¤²»¹Â" />
<link rel="stylesheet" href="/scripts/read/list.css" type="text/css" media="all" />
<script type="text/javascript">
I also tried to download as a file as follows:
import Network.HTTP
openURL :: String -> IO String
openURL = (>>= getResponseBody) . simpleHTTP . getRequest
downloading :: String -> IO ()
downloading = (>>= writeFile fileName) . openURL
But after downloading the file, it is like in the photo:
If I download the page by python (using urllib for example) the characters are displayed normally. Also, if I write a Chinese html and parse it, then there seems to be no problem. Thus it seems that the problem is on the website. However, I don't see any difference between the characters of the site and those I write.
Any help on the reason behind this is well appreciated.
P.S.
The python code is as follows:
import urllib
urllib.urlretrieve('http://www.piaotian.net/html/7/7430/', theFic)
theFic = file_path
And the file is all fine and good.
In Haskell not is a function that is used to check the variable value, to use not in Haskell we have ‘not’ keyword. Also, it is an in-built function in Haskell, so we do not require to induce any dependency or install any library. we can use this directly inside the program.
These are the four types of error handling that are standard and widely used in the Haskell world, as of 2014.
Numeric literals in Haskell are overloaded so that they can represent multiple concrete types (like Int, Integer, Float or even MyOwnNumber ). You can manually chose a specific type by providing type information, like so: These three values have different types and operations performed on these will behave differently.
In Haskell, Int is a type like all other types, and therefore it was easily made a member of the Num and Integral typeclasses used in (^) ( " (^) :: (Num a, Integral b) => a -> b -> a" ). Another member of those typeclasses is Integer, which supports integers of all sizes (as long as you have enough memory for their digits).
I'm pretty sure that if you use Network.HTTP
with the String
type, it converts bytes to characters using your system encoding, which is, in general, wrong.
This is only one of several reasons I don't like Network.HTTP
.
Your options:
Use the Bytestring
interface. It's more awkward for some reason. It'll also require you to decode the bytes to characters manually. Most sites give you an encoding in the response headers, but sometimes they lie. It's a giant mess, really.
Use a different http fetching library. I don't think any remove the messiness of dealing with lying encodings, but they at least don't make it more awkward to not use the system encoding incorrectly. I'd look into wreq or http-client instead.
Here is an updated answer which uses the encoding
package
to convert the GBK encoded contents to Unicode.
#!/usr/bin/env stack
{- stack
--resolver lts-6.0 --install-ghc runghc
--package wreq --package lens --package encoding --package binary
-}
{-# LANGUAGE OverloadedStrings #-}
import Network.Wreq
import qualified Data.ByteString.Lazy.Char8 as LBS
import Control.Lens
import qualified Data.Encoding as E
import qualified Data.Encoding.GB18030 as E
import Data.Binary.Get
main = do
r <- get "http://www.piaotian.net/html/7/7430/"
let body = r ^. responseBody :: LBS.ByteString
foo = runGet (E.decode E.GB18030) body
putStrLn foo
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With