Parsing HTML on the command line; How to capture text in ?

Question

I'm trying to grab data from HTML output that looks like this:

<strong>Target1NoSpaces</strong><span class="creator"> ....
<strong>Target2 With Spaces</strong><span class="creator"> ....

I'm using a pipe train to whittle down the data to the targets I'm trying to hit. Here's my approach so far:

grep "/strong" output.html | awk '{print $1}'

Grep on "/strong" to get the lines with the targets; that works fine.

Pipe to 'awk '{print $1}'. That works in case #1 when the target has no spaces, but fails in case #2 when the target has spaces..only the first word is preserved as below:

<strong>Target1NoSpaces</strong><span
<strong>Target2

Do you have any tips on hitting the target properly, either in my awk or in different command? Anything quick and dirty (grep, awk, sed, perl) would be appreciated.

kenorb · Accepted Answer

Try pup, a command line tool for processing HTML. For example:

$ pup 'strong text{}' < file.html 
Target1NoSpaces
Target2 With Spaces

To search via XPath, try xpup.

Alternatively, for a well-formed HTML/XML document, try html-xml-utils.

Birei · Answer

One way using mojolicious and its DOM parser:

perl -Mojo -E '
    g("http://your.web")
    ->dom
    ->find("strong")
    ->each( sub { if ( $t = shift->text ) { say $t } } )'

Parsing HTML on the command line; How to capture text in <strong></strong>?

Tags:

grep

command-line-interface

sed

awk

perl

Michael J

2 Answers

kenorb

Birei

Recent Activity

Donate For Us