I've been trying to learn how to extract data from HTML files in Haskell, and have hit a wall. I'm not really experience with Haskell at all, and my previous knowledge is from Python (and BeatifulSoup for HTML parsing).
I'm using TagSoup to look at my HTML (seemed to be recommended) and sort of have a basic idea of how it works. Here's the basic segment of my code in question (self-contained, and outputs information for testing):
import System.IO
import Network.HTTP
import Text.HTML.TagSoup
import Data.List
main :: IO ()
main = do
http <- simpleHTTP (getRequest "http://www.cbssports.com/nba/scoreboard/20130310") >>= getResponseBody
let tags = dropWhile (~/= TagOpen "div" []) (parseTags http)
done tags where
done xs = case xs of
[] -> putStrLn $ "\n"
_ -> do
putStrLn $ show $ head xs
done (tail xs)
However, I'm not trying to get to any "div" tag. I want to drop everything prior to a tag in a format like this:
TagOpen "div" [("id","scores-1997830"),("class","scoreBox spanCol2")]
TagOpen "div" [("id","scores-1997831"),("class","scoreBox spanCol2 lastCol")]
I've tried writing it out:
let tags = dropWhile (~/= TagOpen "div" [("id", "scores-[0-9]+"), ("class", "scoreBox( spanCol[0-9]?)+( lastCol)?")]) (parseTags http)
But then it tries to find the literal [0-9]+. I haven't figured out a workaround with the Text.Regex.Posix module yet, and escaping the characters doesn't work. What's the solution here?
~==
does not do regular expressions, you will have to write a matcher yourself, something along the lines of
import Data.Maybe
import Text.Regex
goodTag :: TagOpen -> Bool
goodTag tag = tag ~== TagOpen "div" []
&& fromAttrib "id" tag `matches` "scores-[0-9]+"
-- Just a wrapper around Text.Regex.matchRegex
matches :: String -> String -> Bool
matches string regex = isJust $ mkRegex regex `matchRegex` string
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With