I'm trying to parse the a
links from the main part (<article>
) of a blog post. I have adapted what I found on FPComplete but nothing is printed out. (The code does not work as far as I can see as running it on the online IDE and with the Bing target also produces no links.)
In GHCI I can simulate the first line of parseAF and that gets me a large record, which I take to be correct. But cursor $// findNodes &| extractData
returns []
I've tried regex but that wasn't happy trying to find such a long piece of text.
Can anyone help?
{-# LANGUAGE OverloadedStrings #-}
module HtmlParser where
import Network.HTTP.Conduit (simpleHttp)
import Prelude hiding (concat, putStrLn)
import Data.Text (concat)
import Data.Text.IO (putStrLn)
import Text.HTML.DOM (parseLBS)
import Text.XML.Cursor (Cursor, attribute, element, fromDocument, ($//), (&//), (&/), (&|))
-- The URL we're going to search
url = "http://www.amsterdamfoodie.nl/2015/wine-beer-food-restaurants-troost/"
-- The data we're going to search for
findNodes :: Cursor -> [Cursor]
findNodes = element "article" &/ element "a"
-- Extract the data from each node in turn
extractData = concat . attribute "href"
cursorFor :: String -> IO Cursor
cursorFor u = do
page <- simpleHttp u
return $ fromDocument $ parseLBS page
-- Process the list of data elements
processData = mapM_ putStrLn
-- main = do
parseAF :: IO ()
parseAF = do
cursor <- cursorFor url
processData $ cursor $// findNodes &| extractData
UPDATE After more exploring it seems that the problem lies with element "article"
. If I replace that with element "p"
, which is OK in this instance as the only p
s are in the article
anyway, then I get my links. Pretty weird....!!
I think you can do this in a very readable way with HXT by composing filters:
{-# LANGUAGE Arrows #-}
import Text.XML.HXT.Core
import Text.XML.HXT.Curl
import Text.XML.HXT.TagSoup
links url = extract (readDocument
[ withParseHTML yes
, withTagSoup
, withCurl []
, withWarnings no
] url)
extract doc = runX $ doc >>> xmlFilter "article" >>> xmlFilter "a" >>> toHref
xmlFilter name = deep (hasName name)
toHref = proc el -> do
link <- getAttrValue "href" -< el
returnA -< link
You can call this in the following way:
links "http://www.amsterdamfoodie.nl/2015/wine-beer-food-restaurants-troost/"
OK, so the problem was that &/
only looks at immediate children, whereas &//
will go through all descendants
findNodes = element "article" &// element "a"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With