I need to find all places in a bunch of HTML files, that lie in following structure (CSS): <pre class="prettyprint"><code>div.a ul.b </code></pre> or XPath: <pre class="prettyprint"><code>//div[@class="a"]//div[@class="b"] </code></pre> <code>grep</code> doesn't help me here. Is there a command-line tool that returns all files (and optionally all places therein), that match this criterium? I.e., that returns file names, if the file matches a certain HTML or XML structure.

Try this: <ol> <li>Install http://www.w3.org/Tools/HTML-XML-utils/. <ul> <li>Ubuntu: <code>aptitude install html-xml-utils</code> </li> <li>MacOS: <code>brew install html-xml-utils</code> </li> </ul> </li> <li>Save a web page (call it filename.html).</li> <li>Run: <code>hxnormalize -l 240 -x filename.html | hxselect -s '\n' -c "label.black"</code> </li> </ol> Where <code>"label.black"</code> is the CSS selector that uniquely identifies the name of the HTML element. Write a helper script named <code>cssgrep</code>: <pre class="prettyprint"><code>#!/bin/bash # Ignore errors, write the results to standard output. hxnormalize -l 240 -x $1 2>/dev/null | hxselect -s '\n' -c "$2" </code></pre> You can then run: <pre class="prettyprint"><code>cssgrep filename.html "label.black" </code></pre> This will generate the content for all HTML <code>label</code> elements of the class <code>black</code>. The <code>-l 240</code> argument is important to avoid parsing line-breaks in the output. For example if <code><label class="black">Text to \nextract</label></code> is the input, then <code>-l 240</code> will reformat the HTML to <code><label class="black">Text to extract</label></code>, inserting newlines at column 240, which simplifies parsing. Extending out to 1024 or beyond is also possible. See also: <ul> <li> https://superuser.com/a/529024/9067 - similar question</li> <li> https://gist.github.com/Boldewyn/4473790 - wrapper script</li> </ul>

Is there something like a "CSS selector" or XPath grep?

Tags:

I need to find all places in a bunch of HTML files, that lie in following structure (CSS):

div.a ul.b

or XPath:

//div[@class="a"]//div[@class="b"]

grep doesn't help me here. Is there a command-line tool that returns all files (and optionally all places therein), that match this criterium? I.e., that returns file names, if the file matches a certain HTML or XML structure.

693

asked Sep 07 '11 13:09

Boldewyn

1 Answers

Try this:

Install http://www.w3.org/Tools/HTML-XML-utils/.
- Ubuntu: aptitude install html-xml-utils
- MacOS: brew install html-xml-utils
Save a web page (call it filename.html).
Run: hxnormalize -l 240 -x filename.html | hxselect -s '\n' -c "label.black"

Where "label.black" is the CSS selector that uniquely identifies the name of the HTML element. Write a helper script named cssgrep:

#!/bin/bash

# Ignore errors, write the results to standard output.
hxnormalize -l 240 -x $1 2>/dev/null | hxselect -s '\n' -c "$2"

You can then run:

cssgrep filename.html "label.black"

This will generate the content for all HTML label elements of the class black.

The -l 240 argument is important to avoid parsing line-breaks in the output. For example if <label class="black">Text to \nextract</label> is the input, then -l 240 will reformat the HTML to <label class="black">Text to extract</label>, inserting newlines at column 240, which simplifies parsing. Extending out to 1024 or beyond is also possible.

Dave Jarvis

Related questions
                            
                                Are multiple-inherited constructors called multiple times?
                            
                                bindingConfiguration vs bindingName
                            
                                Reading an IEnumerable multiple times
                            
                                onTouchEvent executes twice
                            
                                Cannot call oracle stored procedure and function
                            
                                Does Data.Vector replace Data.Sequence?
                            
                                Error: Unable to generate a temporary class (result=1) ... When Invoking Methods on a Web Service
                            
                                can't install eclipse plugin "m2e connector for build-helper-maven-plugin 0.15.0.201109290002"
                            
                                How to prevent invalid characters from being typed into input fields
                            
                                Extending The Controller Class in CodeIgniter
                            
                                Indy TCP Client/Server with the client acting as a server
                            
                                Javascript: Calling private method from prototype method

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With