Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing html file

I want to parse html file and to find numbers in certain part of html. It is the goal of this script to get one number per a token. This script must find the number belonging to correct IP address.

The numbers are part of IP, but the IP is not complete, but separated to html tags. That's whz this job is complicated. Till now I have this code:

@echo off
Setlocal EnableDelayedExpansion
SET proxy_3=hide_2.htm         

FOR %%Z IN (hide_2.htm) DO (
FOR /F "tokens=1-20 delims=<>" %%A IN ('grep -B 1411 -E "</table>" %%Z ^| grep -E ^"^(display^|^^\d\d{1,3}^|country^|^<td^>HTTP^|rightborder^).*$^" ') DO (
echo A:%%A + B:%%B + C:%%C + D:%%D + %%E + %%F + %%G + %%H + %%I + %%J + %%K + %%L
FOR %%? in ( "%%~A", "%%~B", "%%~C", "%%~D", "%%~E", "%%~F", "%%~G", "%%~H", "%%~I", "%%~J") DO (
SET $=%%~?
echo $:!$!
)
pause
)
)

I give here link to the code with color formating: http://codepaste.net/iaf4zr

Then, here is html source which I parse: See lines 581-585: http://codepaste.net/11bqxd (Please be patient, it takes some time till load. But for case you dont want to wait, I paste here source html without formating: http://codepaste.net/wdkcdr)

If you want to see shortened version - this is the related part L.581-585: http://codepaste.net/e1t61n

Now I have done some debugging:

A:          + B:td + C:span + D:span + 41 + /span + span style="display: none;"
+ 111 + /span + div +  +
$:
$:td
$:span
$:span
$:41
$:/span
$:span style="display:
$:none
$:
$:111
$:/span
$:div
Press any key to continue...
A: style="display: none;" + B:190 + C:/div + D:span class="" style="" + . + /spa
n + span + 197 + /span + span +  +
$: style="display:
$:none
$:
$:190
$:/div
$:span class="" style=""
$:.
$:/span
$:span
$:197
$:/span
$:span
Press any key to continue...
A: style="display: none;" + B:24 + C:/span + D:span + /span + . + span style="di
splay:  +  +  +  +  +
$: style="display:
$:none
$:
$:24
$:/span
$:span
$:/span
$:.
$:span style="display:
$: "" "" "
Press any key to continue...
A:inline;" + B:132 + C:/span + D:span style="display: none;" + 39 + /span + . +
span  +  +  +  +
$:inline;"" "132" "/span" "span
$:style
$:display: none;"" "39" "/span" "." "span
$: "" "

The dollar - $: marks the value of $ variable, which should be the derived column/token from the second loop without quotes. Here I look for number values, without quotes. This fails in the last case.

Characters B: ... D: marks first 4 tokens/columns, the rest of tokens is not marked...

Where the regarded/related part to lines 581-585 is:

A:inline;" + B:132 + C:/span + D:span style="display: none;" + 39 + /span + . +
span + + + +
$:inline;"" "132" "/span" "span
$:style
$:display: none;"" "39" "/span" "." "span
$: "" "

If you want to see this part in colors, please see this link: http://www.dostips.com/forum/viewtopic.php?f=3&t=3435

So the token B in 2nd loop is 132, no quotes. It looks OK. But in the 3rd loop, it changes to ... style.

Whereas 1st token in 2nd loop is inline;", the 3rd loop shows: inline;"" "132" "/span" "span

Can you explain me how this is possible? I would like to see there 132 when the 2nd member is received. I could parse first 3 numbers successfully, but this is something I cannot help with.

like image 819
John Boe Avatar asked Jun 18 '26 19:06

John Boe


1 Answers

Your problem is with the parsing of quotes. When the line

FOR /F "tokens=1-20 delims=<>" %%A IN 

executes, many of your variables are assigned values which contain one or more double quotes. For example, the first time through the loop, G is assigned equivalently to:

(set G=span style="display: none;")

Then in the internal loop, where you have

FOR %%? in ( "%%~A", "%%~B", "%%~C", "%%~D", "%%~E", "%%~F", "%%~G", "%%~H",...

the "%%~G" is replaced with

"span style="display: none;""

and this gets parsed as two tokens:

"span style="display:

and

none;""

(because, the " between = and display terminates the " at the start, so the space before none; becomes significant)

Similarly in the third time through the loop, which is where you have the problem, A, B, C, and D are assigned equivalently to

set A=inline;"
set B=132
set C=/span
set D=span style="display: none;"
set E=39
set F=/span
set G=.

Now, what is easily missed it the value of H. Careful examination of the "A:..." output line show that H is set equivalently to:

(set H=span )

or

set "H=span "

i.e. H is the string span followed by a space, and so now your inner loop

FOR %%? in ( "%%~A", "%%~B", "%%~C", "%%~D", "%%~E", "%%~F", "%%~G", "%%~H", "%%~I", "%%~J")

is equivalent to (the shell removes the , trailing the " before substituting for the %% vars and parsing for tokens)

FOR %%? in ( "inline;"" "132" "/span" "span style="display: none;"" "39" "/span" "." "span " "" "" )

and look carefully at how that parses. "inline;" is a string, then " " is a string and the imbedded space is not treated as a token separator, followed by 132 with no spaces yet, " " again is a string and the imbedded space is not treated as a token separator, followed by /span and " " where once again the imbedded space is not treated as a token separator, then finally span and a space, so the first token becomes

set ?="inline;"" "132" "/span" "span

Next, we get an undocumented feature of the "for" parsing: an = outside of quotes is treated like a space, so the second token is

set ?=style

Then the third token, start at "display: none;" followed by " " then 39 then " " then/span then " " then . then " " then span, when we finally encounter a significant space, so

set ?="display: none;"" "39" "/span" "." "span

Then the last token is " " followed by " " followed by an unterminated ", so

set ?=" "" ""

In short, what you need to do is get rid of quotes at an appropriate spot. Fundamentally, your problem is that the first token, %%A, contains an unmatched double quote, and that completely screws up the parsing of the text line in the For loop.

like image 159
David I. McIntosh Avatar answered Jun 21 '26 08:06

David I. McIntosh



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!