Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to? xmlstarlet to extract HTML data by id

I have a simple task that has me pulling my hair out, i'm sure i'm very close.

here is my xhtml file:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">

<head>
<title>Test Page</title>
</head>

<body>

<p>
test
</p>

<table id="test_table">
<tr><td>test</td><td>test</td></tr>
<tr><th>mo test</th></tr>
</table>

</body>

</html>

... and xmlstarlet likes it:

$ xmlstarlet.exe el -v test.xhtml
html[@xmlns='http://www.w3.org/1999/xhtml']
html/head
html/head/title
html/body
html/body/p
html/body/table[@id='test_table']
html/body/table/tr
html/body/table/tr/td
html/body/table/tr/td
html/body/table/tr
html/body/table/tr/th

what i need to do is extract the data in the table tag, preferably without the HTML. the context for this is i am writing a test set where a web page is called then written to file. the test requires me to validate the table data but allow the test to succeed if other things on the page change. Also, i will not know in advance how many columns or rows the table will have, it can vary based on the data.

but when i try:

$ xmlstarlet.exe sel -t -c "/html/body/table[@id='test_table']" test.xhtml
Attempt to load network entity http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
None of the XPaths matched; to match a node in the default namespace
use '_' as the prefix (see section 5.1 in the manual).
For instance, use /_:node instead of /node

there are different id's i need for different tests, but they all have unique id values. so, given any 'id' in xhthml, i need it's data.

thanks in advance.

like image 527
matt stucky Avatar asked Feb 25 '14 17:02

matt stucky


2 Answers

The html data has a default namespace that you have to declare in the xmlstarlet command:

xmlstarlet sel \
    -N n="http://www.w3.org/1999/xhtml" \
    -t \
    -c "/n:html/n:body/n:table[@id='test_table']/descendant::*/text()" \
htmlfile 2>/dev/null

Once located the <table> element I use descendant::*/text() to extract all text elements of it, and also use 2>/dev/null to skip the warning:

Attempt to load network entity http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd

It yields:

testtestmo test

UPDATE: I didn't know it but as the error message says, there is no need to declare the namespace when it's the default one, so also this works:

xmlstarlet sel \
    -t \
    -c "/_:html/_:body/_:table[@id='test_table']/descendant::*/text()" \
htmlfile 2>/dev/null
like image 129
Birei Avatar answered Sep 28 '22 04:09

Birei


As is mentioned in

http://xmlstar.sourceforge.net/doc/UG/ch05.html

common problems when using the

-N x="http://www.w3.org/1999/xhtml" \

option you also have to prefix the node selections with

x:

e.g.

 xmlstarlet sel \
  -N x="http://www.w3.org/1999/xhtml" \
  -t \
  -m "//x:pre" \
  -v . somehtml.html

will select all pre nodes

like image 29
Wolfgang Fahl Avatar answered Sep 28 '22 06:09

Wolfgang Fahl