Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In Haskell how do you extract strings from an XML document?

Tags:

xml

haskell

If I have an XML document like this:

<root>
  <elem name="Greeting">
    Hello
  </elem>
  <elem name="Name">
    Name
  </elem>
</root>

and some Haskell type/data definitions like this:

 type Name = String
 type Value = String
 data LocalizedString = LS Name Value

and I wanted to write a Haskell function with the following signature:

 getLocalizedStrings :: String -> [LocalizedString]

where the first parameter was the XML text, and the returned value was:

 [LS "Greeting" "Hello", LS "Name" "Name"]

how would I do this?

If HaXml is the best tool, how would I use HaXml to achieve the above goal?

Thank!

like image 666
Tim Stewart Avatar asked Mar 17 '09 13:03

Tim Stewart


4 Answers

I've never actually bothered to figure out how to extract bits out of XML documents using HaXML; HXT has met all my needs.

{-# LANGUAGE Arrows #-}
import Data.Maybe
import Text.XML.HXT.Arrow

type Name = String
type Value = String
data LocalizedString = LS Name Value

getLocalizedStrings :: String -> Maybe [LocalizedString]
getLocalizedStrings = (.) listToMaybe . runLA $ xread >>> getRoot

atTag :: ArrowXml a => String -> a XmlTree XmlTree
atTag tag = deep $ isElem >>> hasName tag

getRoot :: ArrowXml a => a XmlTree [LocalizedString]
getRoot = atTag "root" >>> listA getElem

getElem :: ArrowXml a => a XmlTree LocalizedString
getElem = atTag "elem" >>> proc x -> do
    name <- getAttrValue "name" -< x
    value <- getChildren >>> getText -< x
    returnA -< LS name value

You'd probably like a little more error-checking (i.e. don't just lazily use atTag like me; actually verify that <root> is root, <elem> is direct descendent, etc.) but this works just fine on your example.


Now, if you need an introduction to Arrows, unfortunately I don't know of any good one. I myself learned it the "thrown into the ocean to learn how to swim" way.

Something that may be helpful to keep in mind is that the proc/-< syntax is simply sugar for the basic arrow operations (arr, >>>, etc.), just like do/<- is simply sugar for the basic monad operations (return, >>=, etc.). The following are equivalent:

getAttrValue "name" &&& (getChildren >>> getText) >>^ uncurry LS

proc x -> do
    name <- getAttrValue "name" -< x
    value <- getChildren >>> getText -< x
    returnA -< LS name value
like image 189
ephemient Avatar answered Nov 05 '22 13:11

ephemient


Use one of the XML packages.

The most popular are, in order,

  1. haxml
  2. hxt
  3. xml-light
  4. hexpat
like image 31
Don Stewart Avatar answered Nov 05 '22 15:11

Don Stewart


FWIW, HXT seems like overkill where a simple TagSoup will do :)

like image 2
ADEpt Avatar answered Nov 05 '22 15:11

ADEpt


Here's my second attempt (after receiving some good input from others) with TagSoup:

module Xml where

import Data.Char
import Text.HTML.TagSoup

type SName = String
type SValue = String

data LocalizedString = LS SName SValue
     deriving Show

getLocalizedStrings :: String -> [LocalizedString]
getLocalizedStrings = create . filterTags . parseTags
  where 
    filterTags :: [Tag] -> [Tag]
    filterTags = filter (\x -> isTagOpenName "elem" x || isTagText x)

    create :: [Tag] -> [LocalizedString]
    create (TagOpen "elem" [("name", name)] : TagText text : rest) = 
      LS name (trimWhiteSpace text) : create rest
    create (_:rest) = create rest
    create [] = []               

trimWhiteSpace :: String -> String
trimWhiteSpace = dropWhile isSpace . reverse . dropWhile isSpace . reverse

main = do
  xml <- readFile "xml.xml"  -- xml.xml contains the xml in the original question.
  putStrLn . show . getLocalizedStrings $ xml

The first attempt showcased a naive (and faulty) method for trimming whitespace off of a string.

like image 1
Tim Stewart Avatar answered Nov 05 '22 13:11

Tim Stewart