Wikipedia provides a link (left side on Print/export) on every article to download the article as pdf. I wrote a small Haskell script which first gets the Wikipedia link and output the rendering link. When I am giving the rendering url as input, I am getting empty tags but the same url in browser provides download link.
Could someone please tell me how to solve this problem? Formated code on ideone.
import Network.HTTP
import Text.HTML.TagSoup
import Data.Maybe
parseHelp :: Tag String -> Maybe String
parseHelp ( TagOpen _ y ) = if any ( \( a , b ) -> b == "Download a PDF version of this wiki page" ) y
then Just $ "http://en.wikipedia.org" ++ snd ( y !! 0 )
else Nothing
parse :: [ Tag String ] -> Maybe String
parse [] = Nothing
parse ( x : xs )
| isTagOpen x = case parseHelp x of
Just s -> Just s
Nothing -> parse xs
| otherwise = parse xs
main = do
x <- getLine
tags_1 <- fmap parseTags $ getResponseBody =<< simpleHTTP ( getRequest x ) --open url
let lst = head . sections ( ~== "<div class=portal id=p-coll-print_export>" ) $ tags_1
url = fromJust . parse $ lst --rendering url
putStrLn url
tags_2 <- fmap parseTags $ getResponseBody =<< simpleHTTP ( getRequest url )
print tags_2
Browse to the page you want to download. Make sure you have Desktop view selected. Mobile devices which default to the Mobile view do not display the required options; to switch to Desktop view, scroll to the bottom of the page and select Desktop . In the left sidebar, under Print/export select Download as PDF .
Here's how to download Wikipedia. In fact, Wikipedia no longer needs an internet connection to access its database — it can be downloaded within a short amount of time. As long as you're prepared for a huge file, it can be done and we're here to show you how.
There are at least six ways to export pages: Paste the name of the articles in the box in Special:Export or use https://en.wikipedia.org/wiki/Special:Export/FULLPAGENAME. Use action=raw . (This fetches just the page's wikitext and not the XML format described below.)
Turn a Wikipedia Article Into a PDF To do this, click the Print/export link on the left sidebar in Wikipedia when you're on an article you want to save. This will show several options for ways to use this article. Click Download as PDF to have Wikipedia automatically generate a PDF copy of this article.
If you try requesting the URL through some external tool like wget
, you will see that Wikipedia does not serve up the result page directly. It actually returns a 302 Moved Temporarily
redirect.
When entering this URL in a browser, it will be fine, as the browser will follow the redirect automatically. simpleHTTP
, however, will not. simpleHTTP
is, as the name suggests, rather simple. It does not handle things like cookies, SSL or redirects.
You'll want to use the Network.Browser
module instead. It offers much more control over how the requests are done. In particular, the setAllowRedirects
function will make it automatically follow redirects.
Here's a quick and dirty function for downloading an URL into a String
with support for redirects:
import Network.Browser
grabUrl :: String -> IO String
grabUrl url = fmap (rspBody . snd) . browse $ do
-- Disable logging output
setErrHandler $ const (return ())
setOutHandler $ const (return ())
setAllowRedirects True
request $ getRequest url
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With