Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

xmllint to parse a html file

I was trying to parse out text between specific tags on a mac in various html files. I was looking for the first <H1> heading in the body. Example:

<BODY>
<H1>Dublin</H1>

Using regular expressions for this I believe is an anti pattern so I used xmllint and xpath instead.

xmllint --nowarning --xpath '/HTML/BODY/H1[0]'

Problem is some of the HTML files contain badly formed tags. So I get errors on the lines of

 parser error : Opening and ending tag mismatch: UL line 261 and LI
</LI>

Problem is I can't just do, 2>/dev/null as then I loose those files altogether. Is there any way, I can just use an XPath expression here and just say, relax if the XML isn't perfect, just give me the value between the first H1 headings?

like image 649
More Than Five Avatar asked Mar 08 '17 19:03

More Than Five


Video Answer


1 Answers

Try the --html option. Otherwise, xmllint parses your document as XML which is a lot stricter than HTML. Also note that XPath indices are 1-based and that HTML tags are converted to lowercase when parsing. The command

xmllint --html --xpath '/html/body/h1[1]' - <<EOF
<BODY>
<H1>Dublin</H1>
EOF

prints

<h1>Dublin</h1>
like image 60
nwellnhof Avatar answered Sep 20 '22 22:09

nwellnhof