Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Download pdf file from wikipedia

Wikipedia provides a link (left side on Print/export) on every article to download the article as pdf. I wrote a small Haskell script which first gets the Wikipedia link and output the rendering link. When I am giving the rendering url as input, I am getting empty tags but the same url in browser provides download link.

Could someone please tell me how to solve this problem? Formated code on ideone.

import Network.HTTP
import Text.HTML.TagSoup
import Data.Maybe

parseHelp :: Tag String -> Maybe String 
parseHelp ( TagOpen _ y ) = if any ( \( a , b ) -> b == "Download a PDF version of this wiki page" ) y 
                      then Just $  "http://en.wikipedia.org" ++   snd (   y !!  0 )
                   else Nothing


parse :: [ Tag String ] -> Maybe String
parse [] = Nothing 
parse ( x : xs ) 
   | isTagOpen x = case parseHelp x of 
              Just s -> Just s 
              Nothing -> parse xs
   | otherwise = parse xs


main = do 
    x <- getLine 
    tags_1 <-  fmap parseTags $ getResponseBody =<< simpleHTTP ( getRequest x ) --open url
    let lst =  head . sections ( ~== "<div class=portal id=p-coll-print_export>" ) $ tags_1
        url =  fromJust . parse $ lst  --rendering url
    putStrLn url
    tags_2 <-  fmap parseTags $ getResponseBody =<< simpleHTTP ( getRequest url )
    print tags_2
like image 663
keep_learning Avatar asked Sep 08 '11 20:09

keep_learning


People also ask

How do I download a PDF from Wikipedia?

Browse to the page you want to download. Make sure you have Desktop view selected. Mobile devices which default to the Mobile view do not display the required options; to switch to Desktop view, scroll to the bottom of the page and select Desktop . In the left sidebar, under Print/export select Download as PDF .

Can I download from Wikipedia?

Here's how to download Wikipedia. In fact, Wikipedia no longer needs an internet connection to access its database — it can be downloaded within a short amount of time. As long as you're prepared for a huge file, it can be done and we're here to show you how.

How do I export a Wikipedia page?

There are at least six ways to export pages: Paste the name of the articles in the box in Special:Export or use https://en.wikipedia.org/wiki/Special:Export/FULLPAGENAME. Use action=raw . (This fetches just the page's wikitext and not the XML format described below.)

How do I save a Wikipedia page for offline reading?

Turn a Wikipedia Article Into a PDF To do this, click the Print/export link on the left sidebar in Wikipedia when you're on an article you want to save. This will show several options for ways to use this article. Click Download as PDF to have Wikipedia automatically generate a PDF copy of this article.


1 Answers

If you try requesting the URL through some external tool like wget, you will see that Wikipedia does not serve up the result page directly. It actually returns a 302 Moved Temporarily redirect.

When entering this URL in a browser, it will be fine, as the browser will follow the redirect automatically. simpleHTTP, however, will not. simpleHTTP is, as the name suggests, rather simple. It does not handle things like cookies, SSL or redirects.

You'll want to use the Network.Browser module instead. It offers much more control over how the requests are done. In particular, the setAllowRedirects function will make it automatically follow redirects.

Here's a quick and dirty function for downloading an URL into a String with support for redirects:

import Network.Browser

grabUrl :: String -> IO String
grabUrl url = fmap (rspBody . snd) . browse $ do
    -- Disable logging output
    setErrHandler $ const (return ())
    setOutHandler $ const (return ())

    setAllowRedirects True
    request $ getRequest url
like image 81
hammar Avatar answered Nov 15 '22 23:11

hammar