Wikipedia provides a link (left side on Print/export) on every article to download the article as pdf. I wrote a small Haskell script which first gets the Wikipedia link and output the rendering link. When I am giving the rendering url as input, I am getting empty tags but the same url in browser provides download link. Could someone please tell me how to solve this problem? Formated code on ideone. <pre class="prettyprint"><code>import Network.HTTP import Text.HTML.TagSoup import Data.Maybe parseHelp :: Tag String -> Maybe String parseHelp ( TagOpen _ y ) = if any ( \( a , b ) -> b == "Download a PDF version of this wiki page" ) y then Just $ "http://en.wikipedia.org" ++ snd ( y !! 0 ) else Nothing parse :: [ Tag String ] -> Maybe String parse [] = Nothing parse ( x : xs ) | isTagOpen x = case parseHelp x of Just s -> Just s Nothing -> parse xs | otherwise = parse xs main = do x <- getLine tags_1 <- fmap parseTags $ getResponseBody =<< simpleHTTP ( getRequest x ) --open url let lst = head . sections ( ~== "<div class=portal id=p-coll-print_export>" ) $ tags_1 url = fromJust . parse $ lst --rendering url putStrLn url tags_2 <- fmap parseTags $ getResponseBody =<< simpleHTTP ( getRequest url ) print tags_2 </code></pre>

If you try requesting the URL through some external tool like <code>wget</code>, you will see that Wikipedia does not serve up the result page directly. It actually returns a <code>302 Moved Temporarily</code> redirect. When entering this URL in a browser, it will be fine, as the browser will follow the redirect automatically. <code>simpleHTTP</code>, however, will not. <code>simpleHTTP</code> is, as the name suggests, rather simple. It does not handle things like cookies, SSL or redirects. You'll want to use the <code>Network.Browser</code> module instead. It offers much more control over how the requests are done. In particular, the <code>setAllowRedirects</code> function will make it automatically follow redirects. Here's a quick and dirty function for downloading an URL into a <code>String</code> with support for redirects: <pre class="prettyprint"><code>import Network.Browser grabUrl :: String -> IO String grabUrl url = fmap (rspBody . snd) . browse $ do -- Disable logging output setErrHandler $ const (return ()) setOutHandler $ const (return ()) setAllowRedirects True request $ getRequest url </code></pre>

Download pdf file from wikipedia

Tags:

http

haskell

haskell-tagsoup

Wikipedia provides a link (left side on Print/export) on every article to download the article as pdf. I wrote a small Haskell script which first gets the Wikipedia link and output the rendering link. When I am giving the rendering url as input, I am getting empty tags but the same url in browser provides download link.

Could someone please tell me how to solve this problem? Formated code on ideone.

import Network.HTTP
import Text.HTML.TagSoup
import Data.Maybe

parseHelp :: Tag String -> Maybe String 
parseHelp ( TagOpen _ y ) = if any ( \( a , b ) -> b == "Download a PDF version of this wiki page" ) y 
                      then Just $  "http://en.wikipedia.org" ++   snd (   y !!  0 )
                   else Nothing


parse :: [ Tag String ] -> Maybe String
parse [] = Nothing 
parse ( x : xs ) 
   | isTagOpen x = case parseHelp x of 
              Just s -> Just s 
              Nothing -> parse xs
   | otherwise = parse xs


main = do 
    x <- getLine 
    tags_1 <-  fmap parseTags $ getResponseBody =<< simpleHTTP ( getRequest x ) --open url
    let lst =  head . sections ( ~== "<div class=portal id=p-coll-print_export>" ) $ tags_1
        url =  fromJust . parse $ lst  --rendering url
    putStrLn url
    tags_2 <-  fmap parseTags $ getResponseBody =<< simpleHTTP ( getRequest url )
    print tags_2

663

asked Sep 08 '11 20:09

keep_learning

1 Answers

If you try requesting the URL through some external tool like wget, you will see that Wikipedia does not serve up the result page directly. It actually returns a 302 Moved Temporarily redirect.

When entering this URL in a browser, it will be fine, as the browser will follow the redirect automatically. simpleHTTP, however, will not. simpleHTTP is, as the name suggests, rather simple. It does not handle things like cookies, SSL or redirects.

You'll want to use the Network.Browser module instead. It offers much more control over how the requests are done. In particular, the setAllowRedirects function will make it automatically follow redirects.

Here's a quick and dirty function for downloading an URL into a String with support for redirects:

import Network.Browser

grabUrl :: String -> IO String
grabUrl url = fmap (rspBody . snd) . browse $ do
    -- Disable logging output
    setErrHandler $ const (return ())
    setOutHandler $ const (return ())

    setAllowRedirects True
    request $ getRequest url

answered Nov 15 '22 23:11

hammar

Related questions
                            
                                API to retrieve total iOS rating from app store?
                            
                                Golang HTTP Request returning 200 response but empty body
                            
                                Flutter Http request not being executed on iOS
                            
                                Why isn't it possible to download a file for status code 4XX and 5XX
                            
                                Azure AD prevents adding HTTP reply URLs
                            
                                How dose nginx get the value of $http_upgrade
                            
                                Nested ServeMux always returns "301 Moved Permanently"
                            
                                Flutter web integration test CORS XMLHttpRequest error
                            
                                Should a web server's firewall block outbound HTTP traffic over port 80?
                            
                                Changing the http referer in javascript
                            
                                Queuing asynchronous HTTP file uploads
                            
                                Access-Control-Request-Header: - x-requested-with
                            
                                What is the difference between HTTP_X_FORWARDED_FOR and HTTP_FORWARDED
                            
                                How to create simple http server with boost capable of receiving data editing it and sharing?
                            
                                Is there a java utility to produce http multi-part responses?
                            
                                How to open webpage in users preferred browser as a post request in Java
                            
                                Invalid characters for http URL
                            
                                Apache ignoring PHP headers when sending a 304
                            
                                How to destroy a Node.js http.request connection nicely?
                            
                                Is It Possible To Use Celery With Another Programming Language?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With