how to properly parse paired html tags?

Question

the question is about parsing an html stream obtained by load/markup in a way you can get html tags constituent parts, i.e. when you find

<div id="one">my text</div>

you should end with something like <div id="one">, {my text} and </div> in the same container, something like

[<div id="one"> {my text} </div>]

or even better

[<div> [id {one}] {my text} </div>]

the parsing problem is matching paired html tags, in html a tag may be an empty tag with maybe attributes but without content and thus without an ending tag or a normal tag maybe with attributes and content and so an ending tag, but both types of tag are just a tag

I mean when you find a sequence like <p>some words</p> you have a P tag just the same you get whit a sequence like <p /> just a P tag, in first case you have associated text and ending tag and in the latter you don't, that's all

In other words, html attributes and content are properties of tag element in html, so representing this in json you will get someting like:

tag: { name: "div" attributes: { id: "one } content: "my text" }

this means you have to identify content of a tag in order to assign it to properly tag, which in terms of rebol parse means identifing matching tags (opening tag and ending tag)

In rebol you can easy parse an html sequence like:

<div id="yo">yeah!</div><br/>

with the rule:

[ some [ tag! string! tag! | tag! ]]

but with this rule you will match the html

<div id="yo">yeah!</div><br/>

and also

<div id="yo">yeah!</p><br/>

as being the same

so you need a way to match the same opening tag when appearing in ending position

sadly rebol tags cannot (AFAIK) be parametrized with tag name, so you cannot say something like:

[ some [ set t1 tag! set s string! set t2 tag!#t1/1 | tag! ] ]

the t1/1 notation is due to a (bad) feature of rebol including all tag atributes at same level of tag name (another bad feature is not reckognizing matching tags as being the same tag)

Of course you can achieve the goal using code such as:

tags: copy []
html: {<div id="yo">yeah!</p><br/>}
parse html [ some [ set t1 tag! set s string! set t2 tag! (tag: first make block! t1 if none <> find t2 tag [append/only tags reduce [t1 s] ]) | tag! (append/only tags reduce [t1])]]

but the idea is to use a more elegant and naive approach using parse dialect only

Pep · Accepted Answer

There's a way to parse pairs of items in rebol parse dialect, simply using a word to store the expected pair:

parse ["a" "a"] [some [set s string! s ]]
parse ["a" "a" "b" "b"] [some [set s string! s ]]

But this doesn't work well with tags due to tags carry attributes and special ending marks (/) and thus it's not easy to find the ending pair from initial one:

parse [<p> "some text" </p>] [some [ set t tag! set s string! t ]
parse [<div id="d1"> "some text" </div>] [some [ set t tag! set s string! t ]

don't work cause </p> is not equal to <p> and neither </div> is equal to <div id="d1">

Again you can fix it with code:

parse load/markup "<p>preug</p>something<br />" [
    some [
        set t tag! (
            b: copy t remove/part find b " " tail b
            insert b "/"
        )
        set s string!
        b (print [t s b])
    |
        tag!
    |
        string!
    ]
]

but this is not simple and zen code anymore, so question's still alive ;-)

how to properly parse paired html tags?

Tags:

rebol

rebol3

Pep

1 Answers

Pep

Recent Activity

Donate For Us

how to properly parse paired html tags?

Tags:

rebol

rebol3

Pep

1 Answers

Pep

Related questions

Recent Activity

Donate For Us