Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing HTML on the command line; How to capture text in <strong></strong>?

I'm trying to grab data from HTML output that looks like this:

<strong>Target1NoSpaces</strong><span class="creator"> ....
<strong>Target2 With Spaces</strong><span class="creator"> ....

I'm using a pipe train to whittle down the data to the targets I'm trying to hit. Here's my approach so far:

grep "/strong" output.html | awk '{print $1}'

Grep on "/strong" to get the lines with the targets; that works fine.

Pipe to 'awk '{print $1}'. That works in case #1 when the target has no spaces, but fails in case #2 when the target has spaces..only the first word is preserved as below:

<strong>Target1NoSpaces</strong><span
<strong>Target2

Do you have any tips on hitting the target properly, either in my awk or in different command? Anything quick and dirty (grep, awk, sed, perl) would be appreciated.

like image 388
Michael J Avatar asked Sep 11 '13 16:09

Michael J


2 Answers

Try pup, a command line tool for processing HTML. For example:

$ pup 'strong text{}' < file.html 
Target1NoSpaces
Target2 With Spaces

To search via XPath, try xpup.

Alternatively, for a well-formed HTML/XML document, try html-xml-utils.

like image 134
kenorb Avatar answered Oct 03 '22 15:10

kenorb


One way using mojolicious and its DOM parser:

perl -Mojo -E '
    g("http://your.web")
    ->dom
    ->find("strong")
    ->each( sub { if ( $t = shift->text ) { say $t } } )'
like image 42
Birei Avatar answered Oct 03 '22 14:10

Birei