how to? xmlstarlet to extract HTML data by id

Question

I have a simple task that has me pulling my hair out, i'm sure i'm very close.

here is my xhtml file:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">

<head>
<title>Test Page</title>
</head>

<body>

<p>
test
</p>

<table id="test_table">
<tr><td>test</td><td>test</td></tr>
<tr><th>mo test</th></tr>
</table>

</body>

</html>

... and xmlstarlet likes it:

$ xmlstarlet.exe el -v test.xhtml
html[@xmlns='http://www.w3.org/1999/xhtml']
html/head
html/head/title
html/body
html/body/p
html/body/table[@id='test_table']
html/body/table/tr
html/body/table/tr/td
html/body/table/tr/td
html/body/table/tr
html/body/table/tr/th

what i need to do is extract the data in the table tag, preferably without the HTML. the context for this is i am writing a test set where a web page is called then written to file. the test requires me to validate the table data but allow the test to succeed if other things on the page change. Also, i will not know in advance how many columns or rows the table will have, it can vary based on the data.

but when i try:

$ xmlstarlet.exe sel -t -c "/html/body/table[@id='test_table']" test.xhtml
Attempt to load network entity http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
None of the XPaths matched; to match a node in the default namespace
use '_' as the prefix (see section 5.1 in the manual).
For instance, use /_:node instead of /node

there are different id's i need for different tests, but they all have unique id values. so, given any 'id' in xhthml, i need it's data.

thanks in advance.

Birei · Accepted Answer

The html data has a default namespace that you have to declare in the xmlstarlet command:

xmlstarlet sel \
    -N n="http://www.w3.org/1999/xhtml" \
    -t \
    -c "/n:html/n:body/n:table[@id='test_table']/descendant::*/text()" \
htmlfile 2>/dev/null

Once located the <table> element I use descendant::*/text() to extract all text elements of it, and also use 2>/dev/null to skip the warning:

Attempt to load network entity http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd

It yields:

testtestmo test

UPDATE: I didn't know it but as the error message says, there is no need to declare the namespace when it's the default one, so also this works:

xmlstarlet sel \
    -t \
    -c "/_:html/_:body/_:table[@id='test_table']/descendant::*/text()" \
htmlfile 2>/dev/null

Wolfgang Fahl · Answer

As is mentioned in

http://xmlstar.sourceforge.net/doc/UG/ch05.html

common problems when using the

-N x="http://www.w3.org/1999/xhtml" \

option you also have to prefix the node selections with

x:

e.g.

 xmlstarlet sel \
  -N x="http://www.w3.org/1999/xhtml" \
  -t \
  -m "//x:pre" \
  -v . somehtml.html

will select all pre nodes

how to? xmlstarlet to extract HTML data by id

Tags:

xml

xhtml

xmlstarlet

matt stucky

2 Answers

Birei

Wolfgang Fahl

Recent Activity

Donate For Us

how to? xmlstarlet to extract HTML data by id

Tags:

xml

xhtml

xmlstarlet

matt stucky

2 Answers

Birei

Wolfgang Fahl

Related questions

Recent Activity

Donate For Us