Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing tags with TagSoup in Haskell

I've been trying to learn how to extract data from HTML files in Haskell, and have hit a wall. I'm not really experience with Haskell at all, and my previous knowledge is from Python (and BeatifulSoup for HTML parsing).

I'm using TagSoup to look at my HTML (seemed to be recommended) and sort of have a basic idea of how it works. Here's the basic segment of my code in question (self-contained, and outputs information for testing):

import System.IO
import Network.HTTP
import Text.HTML.TagSoup
import Data.List

main :: IO ()
main = do
    http <- simpleHTTP (getRequest "http://www.cbssports.com/nba/scoreboard/20130310") >>= getResponseBody
    let tags = dropWhile (~/= TagOpen "div" []) (parseTags http)
    done tags where
        done xs = case xs of
            [] -> putStrLn $ "\n"
            _ -> do
                putStrLn $ show $ head xs
                done (tail xs)

However, I'm not trying to get to any "div" tag. I want to drop everything prior to a tag in a format like this:

TagOpen "div" [("id","scores-1997830"),("class","scoreBox spanCol2")]
TagOpen "div" [("id","scores-1997831"),("class","scoreBox spanCol2 lastCol")]

I've tried writing it out:

let tags = dropWhile (~/= TagOpen "div" [("id", "scores-[0-9]+"), ("class", "scoreBox( spanCol[0-9]?)+( lastCol)?")]) (parseTags http)

But then it tries to find the literal [0-9]+. I haven't figured out a workaround with the Text.Regex.Posix module yet, and escaping the characters doesn't work. What's the solution here?

like image 857
simonsays Avatar asked Mar 16 '13 22:03

simonsays


1 Answers

~== does not do regular expressions, you will have to write a matcher yourself, something along the lines of

import Data.Maybe
import Text.Regex

goodTag :: TagOpen -> Bool
goodTag tag = tag ~== TagOpen "div" []
    && fromAttrib "id" tag `matches` "scores-[0-9]+"

-- Just a wrapper around Text.Regex.matchRegex
matches :: String -> String -> Bool
matches string regex = isJust $ mkRegex regex `matchRegex` string
like image 85
Koterpillar Avatar answered Oct 03 '22 20:10

Koterpillar