Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there something like a "CSS selector" or XPath grep?

Tags:

I need to find all places in a bunch of HTML files, that lie in following structure (CSS):

div.a ul.b

or XPath:

//div[@class="a"]//div[@class="b"]

grep doesn't help me here. Is there a command-line tool that returns all files (and optionally all places therein), that match this criterium? I.e., that returns file names, if the file matches a certain HTML or XML structure.

like image 693
Boldewyn Avatar asked Sep 07 '11 13:09

Boldewyn


People also ask

Is it better to use CSS selector or XPath?

Css has better performance and speed than xpath. Xpath allows identification with the help of visible text appearing on screen with the help of text() function. Css does not have this feature. Customized css can be created directly with the help of attributes id and class.

Is XPath a CSS selector?

What is a CSS Selector? Essentially, the CSS Selector combines an element selector and a selector value that can identify particular elements on a web page. Like XPath, CSS selector can be used to locate web elements without ID, class, or Name.

Is XPath a selector?

The XPath is the language used to select elements in an HTML page. XPath can be used to locate any element on a page based on its tag name, ID, CSS class, and so on. There are two types of XPath in Selenium.

How XPath can be replaced by CSS selector?

In automation of web applications locators plays very major role. Xpath is the one of the most used locator strategy in Selenium automation. We can replace most of the xpaths with css selectors in WebDriver automation. CssSelectors will work fine with IE without any problem.


1 Answers

Try this:

  1. Install http://www.w3.org/Tools/HTML-XML-utils/.
    • Ubuntu: aptitude install html-xml-utils
    • MacOS: brew install html-xml-utils
  2. Save a web page (call it filename.html).
  3. Run: hxnormalize -l 240 -x filename.html | hxselect -s '\n' -c "label.black"

Where "label.black" is the CSS selector that uniquely identifies the name of the HTML element. Write a helper script named cssgrep:

#!/bin/bash

# Ignore errors, write the results to standard output.
hxnormalize -l 240 -x $1 2>/dev/null | hxselect -s '\n' -c "$2"

You can then run:

cssgrep filename.html "label.black"

This will generate the content for all HTML label elements of the class black.

The -l 240 argument is important to avoid parsing line-breaks in the output. For example if <label class="black">Text to \nextract</label> is the input, then -l 240 will reformat the HTML to <label class="black">Text to extract</label>, inserting newlines at column 240, which simplifies parsing. Extending out to 1024 or beyond is also possible.

See also:

  • https://superuser.com/a/529024/9067 - similar question
  • https://gist.github.com/Boldewyn/4473790 - wrapper script
like image 140
Dave Jarvis Avatar answered Oct 02 '22 19:10

Dave Jarvis